>distributed training networks Now that's an idea. One bottleneck might be a lim...

Aeolos · on Nov 8, 2023

There's a ton of work in this area, and the reality is... it doesn't work for LLMs.

Moving from 900GB/sec GPU memory bandwidth with infiniband interconnects between nodes to 0.01-0.1GB/sec over the internet is brutal (1000x to 10000x slower...) This works for simple image classifiers, but I've never seen anything like a large language model be trained in a meaningful amount of time this way.

resters · on Nov 8, 2023

Maybe there is a way to train a neural network in a distributed way by training subsets of it and then connecting the aggregated weight changes to adjacent network segments. It wouldn't recover 1000x interconnect slowdowns, but might still be useful depending on the topology of the network.