You can get similar performance on an Azure HX vm: https://learn.microsoft.com/e...

osti · 2025-07-23T00:14:47 1753229687

How? These don't even have GPU's right?

jychang · 2025-07-23T00:37:38 1753231058

They have similar memory bandwidth compared to the Mac Studio. You can run it off CPU at the same speed.

osti · 2025-07-23T01:15:46 1753233346

Interesting, so with enough memory bandwidth, even the server CPU has enough compute to do inference on a rather large model? Enough to compete against M4 gpu?

Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.

timschmidt · 2025-07-23T01:44:48 1753235088

For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.

yencabulator · 2025-07-24T15:53:00 1753372380

> This is because the entire model's weights must be run through the algorithm many times per prompt.

And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.

osti · 2025-07-23T01:52:54 1753235574

Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.

Edit: actually forgot the MoE part, so that makes sense.

timschmidt · 2025-07-23T02:05:47 1753236347

Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.

yencabulator · 2025-07-24T15:50:04 1753372204

Out of curiosity I've repeatedly compared the tokens/sec of various open weight models and consistently come up with this: tokens/sec/USD is near constant.

If a $4,000 Mac does something at X tok/s, a $400 AMD PC on pure CPU does it at 0.1*X tok/s.

Assuming good choices for how that money is spent. You can always waste more money. As others have said, it's all about memory bandwidth. AMD's "AI Max+ 395" is gonna make this interesting.

And of course you can always just not have enough RAM to even run the model. This tends to happen with consumer discrete GPUs not having that much VRAM, they were built for gaming.

jychang · 2025-07-23T10:39:31 1753267171

ChatGPT is wrong.

Here's Deepseek R1 running off of RAM at 8tok/sec: https://www.youtube.com/watch?v=wKZHoGlllu4