Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.
> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)
> We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license
Which is the point made elsewhere in these comments, e.g. https://news.ycombinator.com/item?id=35344192, and also usefully shows how open Cerebras are. They're pretty open, but not as much as they would be if they were optimising for filling in other companies' moats.
Indeed but this is zero-shot performance. Fine-tuning for a task should get you pretty good results. I'm interested in seeing the results of an Alpaca method against this Cerebras 13B model.
Base model performance is what's most important and also impacts fine-tuning quality. Practically, a model that's good out of the box with minimal fine-tuning is also useful to more people. Since they focused on being training compute optimal for some budget, expect their models to lag behind Llama overall. Their 6.7B version should lag behind GPT-J, assuming 20 tokens per parameter.
The Pythia models are also worth checking out, they might be better than or matched to CerebrasGPTs at each size (although they warn it is not intended for deployment).
Conclusion: the landscape of top open models remains unchanged.
I agree fine-tuning for task will give better results. Cerebras actually showed some research recently on this front. Sparse pre-training and dense fine-tuning (https://arxiv.org/abs/2303.10464). You can recover the accuracy of sparse pre-trained models with dense fine-tuning and reduce FLOPs of the end-to-end pipeline by 2.5x compared to dense.
>I'm interested in seeing the results of an Alpaca method
You're talking apples to oranges. The "Alpaca method" is a dataset generation method. Nothing about Alpaca's training method is novel, interesting, or efficient. Alpaca used the same standard training method everyone else uses, A100 clusters.
If you mean LoRA/PEFT training which people used to replicate Alpaca then that is also apples to oranges because LoRA/PEFT is a finetuning method not a pre-training method.