Infiniband and ethernet are very different at the lowest levels. Ethernet interconnects use RoCE (RDMA over converged ethernet), which actually encapsulated infiniband transport in ethernet, but you still pay for higher routing latency, and you need separate compute-network and storage-network to avoid queueing (lossless ethernet).
Also... don't underestimate the PCI bus bottlenecks when you put 8x 400GB networking + 8x GPUs. There are ways now to have a tree of PCI switches and avoid overloading the main one, each GPU gets its own networking card and PCI switch.
Our cluster is 128 GPUs into a single Dell switch... should help with the queuing. We also have a separate e-w 100G network.
This is why we went with Dell XE9680 chassis... people forget that PCI switches are quite important with this level of compute. Dell has done a good job here.
Interesting, thanks. From the wikipedia link, this seems like the probable culprit for why things break:
"Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered."
https://community.fs.com/article/infiniband-vs-ethernet-whic...
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
Also... don't underestimate the PCI bus bottlenecks when you put 8x 400GB networking + 8x GPUs. There are ways now to have a tree of PCI switches and avoid overloading the main one, each GPU gets its own networking card and PCI switch.