At about 39 minutes in, he's asked if efforts analogous to seti at home can be used to get around the scaling problems with training the next round of big models.
He gives a strongly NVidia oriented answer that I happen to think is dead wrong. Pushing more and more GPU/Memory bandwidth into more and more expensive packages that are obsolete after a year or two isn't the approach that I think will win in the end.
I think systems which eliminate the memory/compute distinction completely, like FPGA but more optimized for throughput, instead of latency, are the way to go.
Imagine if you had a network of machines, that could each handle one layer of an LLM with no memory transfers, your bottleneck would be just getting the data between layers. GPT 4, for example, is likely a 8 separate columns of 120 layers of of 1024^2 parameter matrix multiplies. Assuming infinitely fast compute, you still have to transfer at least 2KB of parameters between layers for every token. Assuming PCI Express 7, at about 200 Gigabytes/second, that's about 100,000,000 tokens/second across all of the computing fabric.
Flowing 13 trillion tokens through that would take 36 hours/epoch.
Doing all of that in one place is impressive. But if you can farm it out, and have a bunch of CPUs and network connections, you're transferring 4k each way for each token from each workstation. It wouldn't be unreasonable to aggregate all of those flows across the internet without the need for anything super fancy. Even if it took a month/epoch, it could keep going for a very long time.
I think you need higher algorithmic intensity. Gradient descent is best for monolithic GPUs. There could be other possibilities for layer-distributed training.
He gives a strongly NVidia oriented answer that I happen to think is dead wrong. Pushing more and more GPU/Memory bandwidth into more and more expensive packages that are obsolete after a year or two isn't the approach that I think will win in the end.
I think systems which eliminate the memory/compute distinction completely, like FPGA but more optimized for throughput, instead of latency, are the way to go.
Imagine if you had a network of machines, that could each handle one layer of an LLM with no memory transfers, your bottleneck would be just getting the data between layers. GPT 4, for example, is likely a 8 separate columns of 120 layers of of 1024^2 parameter matrix multiplies. Assuming infinitely fast compute, you still have to transfer at least 2KB of parameters between layers for every token. Assuming PCI Express 7, at about 200 Gigabytes/second, that's about 100,000,000 tokens/second across all of the computing fabric.
Flowing 13 trillion tokens through that would take 36 hours/epoch.
Doing all of that in one place is impressive. But if you can farm it out, and have a bunch of CPUs and network connections, you're transferring 4k each way for each token from each workstation. It wouldn't be unreasonable to aggregate all of those flows across the internet without the need for anything super fancy. Even if it took a month/epoch, it could keep going for a very long time.