Hacker News

aurareturn · 2025-08-28T06:25:49 1756362349

What model are you running?

I suspect that you’re running a very large model like DeepSeek in coherent memory?

Keep in mind that this little DGX only has 128GB which means it can run fairly small models such as qwen3 coder where prompt processing is not an issue.

I’m not doubting your experience with GH200 but it doesn’t seem relevant here because the bandwidth for Spark is the bottleneck well before the prompt processing.

Y_Y · 2025-08-28T02:54:10 1756349650

I like the cut of your jib and your experience matches mine, but without real numbers this is all just piss in the wind (as far as online discussions go).

BoorishBears · 2025-08-28T03:14:28 1756350868

You're right, it's unfortunate I didn't keep the benchmarks around: I benchmark a lot of configurations and providers for my site and have a script I typically run that produces graphs for various batch sizes (https://ibb.co/0RZ78hMc)

The performance with offloading was just so bad I didn't even bother proceeding to the benchmark (without offloading you get typical H100 speeds)