Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any idea if there is a way to run on 256gb ram + 16gb vram with usable performance, even if barely?


Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...


awesome documentation, I'll try this. thank you!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: