Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ML training is a tightly interconnected HPC style workload (network bound), which is a market Infiniband has been targeting for a long time.

Nvidia saw that and bought Mellanox, and made NCCL/GPU's work really well with Infiniband.

Large public clouds already have huge investments in ethernet and don't want to be further locked into Nvidia, so Nvidia does have a roadmap for ethernet GPU clusters (roughly 1 year behind Infiniband).

But if you are building your own Nvidia cluster, it would be silly to build it on ethernet. Just buy exactly what Nvidia recommends, you are already locked in anyway.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: