Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
Batch size is dynamic, in MoE apparently the experts chosen depend on the batch (not only your single inference request, which sounds weird to me, but I'm just an end user), no one audited the inference pipeline for floating point nondeterminisms, and I'm not even sure that temperature 0 implies deterministic sampling (the quick math formula I found has e^(1/temp) which means that 0 is not a valid value anyways and would need some dealing with).
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.
That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.
I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.