You don't know what transformers unmodified scale up to. Nearly scaled maximally...

versteegen · on March 29, 2023

The degree to which transformers scale, as measured by loss (training objective) is known remarkably well! See [1]. There's a formula in there for the minimum loss you possibly achieve even with infinite compute and training data, and it's hardly less than Chinchilla's loss. The recent GPT-4 paper further reinforces that these scaling laws are real, because they predicted with high accuracy the loss the model would have based on data+compute used.

Admittedly, the link between improvement in loss and improvement of capabilities may break down or be misleading.

However, there's just not much training data on the internet left unused. Maybe an order of magnitude. All books ever published (in English?) are a smaller dataset than the corpora already used for training. See [2] (which includes an easy summarisation of much of [1]). And the scaling laws show training data is already the bottleneck rather than compute.

[1] DeepMind, 2022, Training Compute-Optimal Large Language Models https://arxiv.org/abs/2203.15556

[2] Chinchilla's wild implications https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...

og_kalu · on March 29, 2023

comparing Loss between different training runs and hyperparameters isn't very accurate. Llama's loss metrics don't really match chinchilla's for instance, i.e it went below the minimum possible loss stated by chinchilla

More importantly, These models are extremely sensitive to loss. 2.0 to 1.8 might not seem like much but it's a huge gain in performance.

GPT-2 loss was 2.57. GPT-3 was 2

and there is plenty of training data left. perhaps not easily accessible but it's there.

versteegen · on March 30, 2023

True that a scaling law only applies to models within a family, which allows some but not full choice of hyperparamaters. And that most of the minimum loss is just due to the unpredictability of language, so 2.0 vs 1.8 bits should actually be thought of as (say) 0.3 vs 0.1 bits plus an irrelevant 1.7 bits of randomness.

I hadn't actually looked at the LLaMA paper, that's an interesting note. However AFAICT GPT3, LLaMA and Chinchilla do not use the same tokenizer, so their losses are not comparable. GPT2 and 3 use (the same) custom BPE tokenizer. LLaMa uses SentencePiece but that generates a vocabulary specific to the training data it's run on. Chinchilla used "a slightly modified SentencePiece tokenizer that does not apply NFKC normalisation. The vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher".

Even if there is a lot more text available, it doesn't mean it's good training material. And the better free sources are already used. E.g. LLaMa was trained on 64% of GitHub that had a compatible license (and you're not going to gather much more source code than that), all the free book texts they could find, all of arXiv, all English pages in CommonCrawl that classified as "reference" quality, etc. arXiv, for example, isn't all scientific papers ever, but it's a large fraction of them. All private emails stored by a large email service would probably be one of the biggest untapped valuable sources.

Lockal · on March 29, 2023

What does these numbers mean? For example, for Google isn't loss == 0? But it does not make Google a superintelligence.