Interesting, so with enough memory bandwidth, even the server CPU has enough compute to do inference on a rather large model? Enough to compete against M4 gpu?
Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.
For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.
> This is because the entire model's weights must be run through the algorithm many times per prompt.
And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.
Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.
Edit: actually forgot the MoE part, so that makes sense.
Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.
Out of curiosity I've repeatedly compared the tokens/sec of various open weight models and consistently come up with this: tokens/sec/USD is near constant.
If a $4,000 Mac does something at X tok/s, a $400 AMD PC on pure CPU does it at 0.1*X tok/s.
Assuming good choices for how that money is spent. You can always waste more money. As others have said, it's all about memory bandwidth. AMD's "AI Max+ 395" is gonna make this interesting.
And of course you can always just not have enough RAM to even run the model. This tends to happen with consumer discrete GPUs not having that much VRAM, they were built for gaming.
https://learn.microsoft.com/en-us/azure/virtual-machines/siz...