To run the real version with the bench arks they give, it would be a nonquantized non distilled version. So I am guessing that is a cluster of 8 H200s if you want to be more or less up to date. They have B200s now which are much faster but also much more expensive. $300,000+
You will see people making quantized distilled versions but they never give benchmark results.
Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok
You will see people making quantized distilled versions but they never give benchmark results.