Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would expect/hope that DGX would be able to make better use of its bandwidth than the M4 Max. Will need to wait and see benchmarks.


Matrix vector multiplication for feed forward layers is most of the bandwidth as I understand things, there's not really a way to do it "better", its just a bunch of memory-bound dot products.

(Posting this comment in hopes of being corrected and learning something).


The problem is different parts of the SoC (CPU, GPU, NPU) may not actually be able to consume all of the bandwidth available to the system as a whole. This is why you'd need to benchmark - different chips may be able to feed the cores better than others.


Ah, yeah. I guess as we venture further into SoCs that will be more common, I was just thinking "it's whatever the memory controller can do".


Training is performed in parallel with batching and is more flops heavy. I don't have an intuition on how memory bandwidth intensive updating the parameters is. It shouldn't be much worse than doing a single forward pass though.


It should. It has tensor cores which should drastically improve prompt processing. It should also be highly optimized for most AI apps.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: