The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.
Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.
Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.
If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.
My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.
It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).
With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.
I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.
I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:
- this is a spam email
But if the sender adapts and says
- th1s is a spam email
It wouldn't be flagged as spam.