Hacker News new | past | comments | ask | show | jobs | submit login

Infiniband and ethernet are very different at the lowest levels. Ethernet interconnects use RoCE (RDMA over converged ethernet), which actually encapsulated infiniband transport in ethernet, but you still pay for higher routing latency, and you need separate compute-network and storage-network to avoid queueing (lossless ethernet).

https://community.fs.com/article/infiniband-vs-ethernet-whic...

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

Also... don't underestimate the PCI bus bottlenecks when you put 8x 400GB networking + 8x GPUs. There are ways now to have a tree of PCI switches and avoid overloading the main one, each GPU gets its own networking card and PCI switch.




This is a great comment.

Our cluster is 128 GPUs into a single Dell switch... should help with the queuing. We also have a separate e-w 100G network.

This is why we went with Dell XE9680 chassis... people forget that PCI switches are quite important with this level of compute. Dell has done a good job here.


Interesting, thanks. From the wikipedia link, this seems like the probable culprit for why things break:

"Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered."


In practice it's easy to design an Ethernet network that doesn't reorder packets.


But if packets have to go over the internet at some point, aren't all bets off if you use UDP?


ROCE doesn't go over the Internet, most ISPs don't reorder packets, and UDP isn't treated specially.


RoCE does not encapsulate infiniband


You are right. Sorry, I quoted the linked article. I haven't worked on the networking side to that level of detail.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: