Is that what it does, though? I thought setting temperature to 0 would (extremel...

samus · on April 1, 2024

The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.

Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.

Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.

If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.

none_to_remain · on April 1, 2024

My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.

refulgentis · on April 1, 2024

It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).

With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.

I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.