To run the real version with the bench arks they give, it would be a nonquantize...

danielhanchen · 2025-07-22T22:31:49 1753223509

Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok

summarity · 2025-07-22T22:47:23 1753224443

This should work well for MLX Distributed. The low activation MoE is great for multi node inference.

ilaksh · 2025-07-23T00:09:11 1753229351

1. What hardware for that. 2. Can you do a benchmark?