Hacker News new | past | comments | ask | show | jobs | submit login

Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: