Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here are the zero-shot accuracy numbers posted in the Huggingface evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in their paper:

    Model              BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
    LLaMa 13B          78.1  80.1 50.4 79.2      73         74.8  52.7  56.4
    Cerebras-GPT 13B   -     76.6 -    51.3      64.6       71.4  36.7  28.6


Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.


I guess it's something. It still goes to show how far open models are behind the proprietary SOTA.


From their discord:

> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)

> We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license

Which is the point made elsewhere in these comments, e.g. https://news.ycombinator.com/item?id=35344192, and also usefully shows how open Cerebras are. They're pretty open, but not as much as they would be if they were optimising for filling in other companies' moats.


Indeed but this is zero-shot performance. Fine-tuning for a task should get you pretty good results. I'm interested in seeing the results of an Alpaca method against this Cerebras 13B model.


Base model performance is what's most important and also impacts fine-tuning quality. Practically, a model that's good out of the box with minimal fine-tuning is also useful to more people. Since they focused on being training compute optimal for some budget, expect their models to lag behind Llama overall. Their 6.7B version should lag behind GPT-J, assuming 20 tokens per parameter.

The Pythia models are also worth checking out, they might be better than or matched to CerebrasGPTs at each size (although they warn it is not intended for deployment).

Conclusion: the landscape of top open models remains unchanged.


I agree fine-tuning for task will give better results. Cerebras actually showed some research recently on this front. Sparse pre-training and dense fine-tuning (https://arxiv.org/abs/2303.10464). You can recover the accuracy of sparse pre-trained models with dense fine-tuning and reduce FLOPs of the end-to-end pipeline by 2.5x compared to dense.


>I'm interested in seeing the results of an Alpaca method

You're talking apples to oranges. The "Alpaca method" is a dataset generation method. Nothing about Alpaca's training method is novel, interesting, or efficient. Alpaca used the same standard training method everyone else uses, A100 clusters.

If you mean LoRA/PEFT training which people used to replicate Alpaca then that is also apples to oranges because LoRA/PEFT is a finetuning method not a pre-training method.


One could take the alpaca dataset and fine tune using the LoRA/PEFT method and compare to the Stanford alpaca fine tuned llama model.

Presumably…


Have these models been trained on the same dataset? Otherwise it is apples to oranges comparison.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: