Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.

They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).

If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.



> the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero

Isn’t the test loss logarithmic? If so it sure can go below zero.


According to https://pile.eleuther.ai/paper.pdf the test loss on the pile is the log of the perplexity, and the perplexity is 2^H where H is an entropy which is non-negative. So the perplexity is always at least one, so its log is always at least zero.

So yes the test loss can be seen as a log, but no it's not allowed to go below zero.

The intuition is that the test loss is the number of bits that the model would need on average to encode each next token in the test part of the pile, given that you have seen the preceding parts.


Good point!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: