Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To run the real version with the bench arks they give, it would be a nonquantized non distilled version. So I am guessing that is a cluster of 8 H200s if you want to be more or less up to date. They have B200s now which are much faster but also much more expensive. $300,000+

You will see people making quantized distilled versions but they never give benchmark results.



Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok


This should work well for MLX Distributed. The low activation MoE is great for multi node inference.


1. What hardware for that. 2. Can you do a benchmark?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: