Production inference is not deterministic because of sharding (i.e. parameter we...

Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.