The point of those smaller models is for the "Cerebras Scaling Law for Compute-O...

bjornsing · on March 28, 2023

> the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero

Isn’t the test loss logarithmic? If so it sure can go below zero.

ftxbro · on March 28, 2023

According to https://pile.eleuther.ai/paper.pdf the test loss on the pile is the log of the perplexity, and the perplexity is 2^H where H is an entropy which is non-negative. So the perplexity is always at least one, so its log is always at least zero.

So yes the test loss can be seen as a log, but no it's not allowed to go below zero.

The intuition is that the test loss is the number of bits that the model would need on average to encode each next token in the test part of the pile, given that you have seen the preceding parts.

bjornsing · on March 30, 2023

Good point!