There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.
Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.
There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.
You can think of this as an optimization function.
Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.
A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.
This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them
Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.
There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.
You can think of this as an optimization function.
Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.
A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.
This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them