Is setting temperature to 0 even a valid way to measure LLM performance over tim...

criemen · 2025-09-20T19:10:15 1758395415

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

visarga · 2025-09-20T19:42:07 1758397327

It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

criemen · 2025-09-21T15:54:05 1758470045

Batch size is dynamic, in MoE apparently the experts chosen depend on the batch (not only your single inference request, which sounds weird to me, but I'm just an end user), no one audited the inference pipeline for floating point nondeterminisms, and I'm not even sure that temperature 0 implies deterministic sampling (the quick math formula I found has e^(1/temp) which means that 0 is not a valid value anyways and would need some dealing with).

Spivak · 2025-09-20T21:07:34 1758402454

I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.

That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.

I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.

jonplackett · 2025-09-20T18:57:43 1758394663

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven · 2025-09-20T19:01:52 1758394912

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

numpad0 · 2025-09-20T20:43:11 1758400991

Pure sci-Fi idea: what if actually nothing was changed, but RNGs were becoming less random as we extract more randomness out of the universe?

maxbond · 2025-09-21T01:13:05 1758417185

I bet they did both. If I'm reading the documentation right you have to supply a seed in order to get "best effort" determinism.

https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...