My experience working on ml at couple faang like companies is gpus actually tend to be too fast compute wise and often models are unable to come close to theoretical nvidia flops numbers. In that very frequently bottlenecks from profiling are elsewhere. It is very easy to have your data reading code be bottleneck. I have seen some models where our networking was bottleneck and could not keep up with the compute and we had adjust model architecture in ways to reduce amount of data transferred in training steps across the cluster. Or maybe you have gpu memory bandwidth as bottleneck. Key idea in flash attention work is optimizing attention kernels to lower amount of vram usage and stick to smaller/faster sram. This is valuable work, but is also kind of work that is pretty rare engineer I have worked with would have cuda kernel experience to create custom efficient kernels. Some of the models I train use a lot of sparse tensors as features and tensorflow’s sparse gpu kernel is rather bad with many operations either falling back to cpu or sometimes I have had gpu sparse kernel that was slower than cpu equivalent kernel. Several times densifying and padding tensors with large fraction of 0’s was faster than using sparse kernel.
I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.
Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.
Let me reframe the question - assume it's not only 100x GPUs, but all the performance bottlenecks you've mentioned are also solved or accelerated x100.
What kind of improvement would we observe, given the current state of the models and knowledge ?
If I assume you mean LLM like models similar to chatgpt that is pretty debated in the community. Several years ago many people in ML community believed we were at plateau and that throwing more compute/money would not give significant improvements. Well then LLMs did much better than expected as they scaled up and continue to iterate now on various benchmarks.
So are we now at performance plateau? I know people at openai like places that think AGI is likely in next 3-5 years and is mostly scaling up context/performance/a few other key bets away. I know others who think that is unlikely in next few decades.
My personal view is I would expect 100x speed up to make ML used even more broadly and to allow more companies outside big players to have there own foundation models tuned for their use cases or other specialized domain models outside language modeling. Even now I still see tabular datasets (recommender systems, pricing models, etc) as most common to work in industry jobs. As for impact 100x compute will have for leading models like openai/anthropic I honestly have little confidence what will happen.
The rest of this is very speculative and not sure of, but my personal gut is we still need other algorithmic improvements like better ways to represent storing memory that models can later query/search for, but honestly part of that is just math/cs background in me not wanting everything to end up being hardware problem. Other part is I’m doubtful human like intelligence is so compute expensive and we can’t find more cost efficient ways for models to learn but maybe our nervous system is just much faster at parallel computation?
The human brain manages to work with 0.3 kWh per day - even if we say all of that is used for training "models" and for twenty years that's only 2200kwh - much less then what chat needed to train (500mwh?). So there are obviously lots of thinks we can do to improve efficiency. On the other hand, our brains hat hundreds of millions of years to be optimized for energy consumption.
A friend showed me some python code or something that demonstrates facial recognition by calculating the distance between facial features - eyes, nose...
I had never thought about this before but how do I recognize faces? I mostly recognize faces by context. And I don't have to match against a billion faces, probably a hundred or so? And I still suck at this.
The fact that human brain works with 0.3 kW per day likely doesn't mean much. How do we even start asking the question - is a human brain thermally (or resource in general) constrained?
The brain is responsible for about 1/5 of the total energy expenditure (and therefore food requirement)of a human body. So yes, on a biological level, there is significant resource constraints on a human brain. What is less clear is whether this actually holds for the “computing” part (as contrasted with the “sustainment”, think cell replacement, part)
I've seen improvement numbers up to 12x, but after that the returns are so diminishing that there's not really a point. 12x on training costs I mean, probably still won't get AGI.
Well put. This was my experience when working for an AI start up too.
Frustratingly, it’s also the hardest part to solve too. Throwing more compute at the problem is easy but diagnosing and then solving those other bottlenecks takes a great deal of time and not to mention experience across a number of specialty domains that aren’t typically mutually inclusive.
I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.
Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.