The network bandwidth between nodes is a bigger limitation than compute. The new...

fourthark · 2025-09-26T20:33:53 1758918833

Seems like training would be a better match, where you need tons of compute but don’t care about latency.

ronsor · 2025-09-27T03:56:53 1758945413

No, the problem is that with training, you do care about latency, and you need a crap-ton of bandwidth too! Think of the all_gather; think of the gradients! Inference is actually easier to distribute.

meehai · 2025-09-27T07:56:37 1758959797

Yeah, but if you can do topologies based on latencies you may get some decent tradeoffs. For example with N=1M nodes each doing batch updates in a tree manner, i.e the all reduce is actually layered by latency between nodes.

shaklee3 · 2025-09-27T04:38:13 1758947893

800Gbps