I like to think (hope) that the next breakthrough will come not from these huge clusters, but from somebody tinkering with new ideas on a small local system.
I also wonder - is the compute the main limiting factor today ? Let's imagine there is an unlimited number of NVidia chips available right now and energy is cheap - would using a cluster x100 of current biggest one result in a significant improvement ? My naive intuition is that not really.
My experience working on ml at couple faang like companies is gpus actually tend to be too fast compute wise and often models are unable to come close to theoretical nvidia flops numbers. In that very frequently bottlenecks from profiling are elsewhere. It is very easy to have your data reading code be bottleneck. I have seen some models where our networking was bottleneck and could not keep up with the compute and we had adjust model architecture in ways to reduce amount of data transferred in training steps across the cluster. Or maybe you have gpu memory bandwidth as bottleneck. Key idea in flash attention work is optimizing attention kernels to lower amount of vram usage and stick to smaller/faster sram. This is valuable work, but is also kind of work that is pretty rare engineer I have worked with would have cuda kernel experience to create custom efficient kernels. Some of the models I train use a lot of sparse tensors as features and tensorflow’s sparse gpu kernel is rather bad with many operations either falling back to cpu or sometimes I have had gpu sparse kernel that was slower than cpu equivalent kernel. Several times densifying and padding tensors with large fraction of 0’s was faster than using sparse kernel.
I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.
Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.
Let me reframe the question - assume it's not only 100x GPUs, but all the performance bottlenecks you've mentioned are also solved or accelerated x100.
What kind of improvement would we observe, given the current state of the models and knowledge ?
If I assume you mean LLM like models similar to chatgpt that is pretty debated in the community. Several years ago many people in ML community believed we were at plateau and that throwing more compute/money would not give significant improvements. Well then LLMs did much better than expected as they scaled up and continue to iterate now on various benchmarks.
So are we now at performance plateau? I know people at openai like places that think AGI is likely in next 3-5 years and is mostly scaling up context/performance/a few other key bets away. I know others who think that is unlikely in next few decades.
My personal view is I would expect 100x speed up to make ML used even more broadly and to allow more companies outside big players to have there own foundation models tuned for their use cases or other specialized domain models outside language modeling. Even now I still see tabular datasets (recommender systems, pricing models, etc) as most common to work in industry jobs. As for impact 100x compute will have for leading models like openai/anthropic I honestly have little confidence what will happen.
The rest of this is very speculative and not sure of, but my personal gut is we still need other algorithmic improvements like better ways to represent storing memory that models can later query/search for, but honestly part of that is just math/cs background in me not wanting everything to end up being hardware problem. Other part is I’m doubtful human like intelligence is so compute expensive and we can’t find more cost efficient ways for models to learn but maybe our nervous system is just much faster at parallel computation?
The human brain manages to work with 0.3 kWh per day - even if we say all of that is used for training "models" and for twenty years that's only 2200kwh - much less then what chat needed to train (500mwh?). So there are obviously lots of thinks we can do to improve efficiency. On the other hand, our brains hat hundreds of millions of years to be optimized for energy consumption.
A friend showed me some python code or something that demonstrates facial recognition by calculating the distance between facial features - eyes, nose...
I had never thought about this before but how do I recognize faces? I mostly recognize faces by context. And I don't have to match against a billion faces, probably a hundred or so? And I still suck at this.
The fact that human brain works with 0.3 kW per day likely doesn't mean much. How do we even start asking the question - is a human brain thermally (or resource in general) constrained?
The brain is responsible for about 1/5 of the total energy expenditure (and therefore food requirement)of a human body. So yes, on a biological level, there is significant resource constraints on a human brain. What is less clear is whether this actually holds for the “computing” part (as contrasted with the “sustainment”, think cell replacement, part)
I've seen improvement numbers up to 12x, but after that the returns are so diminishing that there's not really a point. 12x on training costs I mean, probably still won't get AGI.
Well put. This was my experience when working for an AI start up too.
Frustratingly, it’s also the hardest part to solve too. Throwing more compute at the problem is easy but diagnosing and then solving those other bottlenecks takes a great deal of time and not to mention experience across a number of specialty domains that aren’t typically mutually inclusive.
> I also wonder - is the compute the main limiting factor today ?
So much of this is a blackbox that you have to wing it for lots of things and try stuff. The more compute you have, the more YOLO runs you can do.
For example, the research team I work with has about 1/20th the compute that a Google researcher has. This gives them a massive advantage because they can afford to train lots of random ideas and see what kind of advantage they get. We have to be much, much more measured and predict our outcomes better.
> the research team I work with has about 1/20th the compute that a Google researcher has. This gives them a massive advantage
The search space could easily be so large that a less disciplined approach might yield fewer useful outcomes even with the compute advantage. Being forced to be focused and disciplined might actually be a big advantage to you.
It’s hard to be disciplined about a black box though. That’s one reason why we’re all speeding off at a thousand miles per hour on transformers - the architecture works, why try other things?
To be fair I don't think anybody saw the boom in LLMs coming from the initial Attention paper. At the time it was one of many ideas that sounded like they had potential.
> So much of this is a blackbox that you have to wing it for lots of things and try stuff
Isn't OP talking about this _blackbox_ of gains?
But this is highly disappointing too - there are speed gains and tradeoffs to be had at every blackbox layer, but instead of doing that and actually gaining these improvements across all future experiments, we do YOLO runs and don't garner them at all.
I think you're right. It does seem like the models need exponentially larger datasets to get linear improvements, which are now in the realm of having to buy them from large social media companies. The next breakthrough will probably being doing something different rather than doing it slightly better.
given that language timeseries itself turned out to store some form of intelligence, I wonder how much more of the human mind is trained on video-input/proprioception/action timeseries. i.e. make robots that make small decisions and actions, train on their experience, do more complicated actions, train on those experiences - there's your 10x, 100x, e^x training tokens, save the language for fine tuning. language as a specific task of a general, world-interacting robot.
Yeah I mean this already exists in the human brain, for example we're more likely to be surprised by vision lower to the ground than in the sky. The reason? Snakes. And a lot of animals have a startle reaction to snakes as it's a common problem out there in the world.
I think the next step is spiking / time domain networks and evolutionary training techniques to deal with the inherent non-linearity of spiking activity. Also, maybe mix in some biologically plausible learning rules such as STDP.
I've seen very compelling results in small scale experiments. I've personally achieved basic tool use (reading a prompt buffer and writing to an output buffer) in a SNN simulation. The biggest takeaway for me is that the fitness function is the most important thing followed by selection criteria and simulation parameters. Figuring out how to compact the total measure of fitness into a single float as the simulation progresses is non trivial. I've found that dynamically adapting the fitness function and simulation parameters over time is essential.
To be clear, I don't know if I've achieved learning/generalization by the formal definition yet. I do know I can minimize the loss over a training set by merely mutating the weights of neurons with otherwise fixed delay and connectivity. I'm not even doing things like crossover or elitism yet. My next sprint will attempt varying the delays as well as the weights. I've also got some really strong opinions about the ratio of excitatory to inhibitory neurons now. I think inhibitory neurons are a bandaid for physical limitations. In a simulation, we can simply constrain the total allowed ticks or global activations per candidate. The selection pressure will reject candidates that can't get the job done with the allotted resources.
I think you can do the real deal on one powerful workstation. I've got networks with 10k neurons and I can run through about 4000 generations x 250 candidates on a single threadripper in a day. The main bottleneck being memory bandwidth as networks grow in size. Cloning them each generation starts to take a long time once you get into 6 figures. You really want as many generations per unit time. Being able to watch it evolve at meaningful rates of speed can unlock new understanding about your fitness function.
> Figuring out how to compact the total measure of fitness into a single float as the simulation progresses is non trivial. I've found that dynamically adapting the fitness function and simulation parameters over time is essential.
RL style training can let you train against a vector of losses...
Custom loss functions are a thing...
> I've found that dynamically adapting the fitness function and simulation parameters over time is essential
Did you just figure out epsilon greedy? - very well known technique...
Spiking / time domain... Non linearities can already be captured by stacking linearly activated layers, or connecting layers. Time domain, CNNs.. RNNs.. ResNet, U-Net as (now almost ancient) examples cover alot of the same ground.
10K neuron networks are tiny. I dont know what your trying to accomplish but I would suggest reading more literature, because it sounds like your stuck in a local optima of old ideas...
> I've seen very compelling results in small scale experiments.
Small scale experiments are approximately meaningless in ML. A lot of the clever tricks that improve results at smaller scales have their gains BTFO'd by the scaling you give up as you increase the size of the model.
The key insight from deep learning was that the gains from doing a smarter thing were dwarfed by doing a dumber thing a lot more.
For a smarter trick to be valuable, it needs to not only improve results at the small scale, it needs to not become a bottleneck in your system as you jack up the size.
If you look at the advancements over time you see exactly how it correlates with nvidias architecture tocks and the availability of better cards for someone tinkering in their basement. We need better consumer cards for your dream to come true.
Yes, compute is absolutely the limiting factor today. Not only because the space of hyperparameters is huge and having more compute would make it easier/possible to explore. But also, weirdly, because inference becomes increasingly important for training, which means even more compute! A lot of work these days goes into getting better data and it turns out that using an existing large model to create/curate data for you works really well.
Zuckerburg said even the smallest model isn't fully converged and continues to improve with more data. Larger model takes much larger amount of data to converge.
And as llama is trained on 15T tokens, we have lot more tokens remaining on the internet. And different modalities like video will have orders of magnitude more token and more information.
There generally would be a limit we would reach to probabilistic syllable generators. We would reach some max workable context window, minimize hallucinations, etc.
But still they would be extremely useful, even if very different than AGI
There's a few videos by Sabine/Anton and also now PBS about Microtubules and conciousness, it's starting to gain ground and If I were an AI company right now I'd be having cold sweat.
Video of reference (check especially the ending summary where they calculate more or less how much would be necessary just to replicate a human mind's way of achieving conciousness if microtubules in fact contribute towards it):
the argument for quantum consciousness is quite interesting:
1. consciousness is really weird.
2. quantum is really weird.
3. what if the two were really connected?
edit: oh god, I just had a horrible thought: the universe is a simulation, and the reason consciousness requires "quantum" spookiness is because it's running on a different computer and plugged in as an adhoc skill.
It is not much of an argument, though. It is at the same level as religion, in terms of explanatory power: “I don’t understand it, therefore it has to be a quantum effect”. We ought to be more humble than that. There are many very complicated things that we don’t understand and that have nothing to do with quantum mechanics.
I think it depends on the framing of the question, especially how you define compute.
- No, compute is not the limiting factor. The limiting factor is poorly optimized software (there's a joke: "10 years of hardware advancements have been entirely undone by 10 years of software advancements")
- No, compute is not the limiting factor. The limiting factor is that the electron is too big and the speed of light is too slow.
If we're talking about ML, then no, compute is not the limiting factor.
At least if we're define compute as the number of FLOPs we can process and not in terms of algorithm or resultant abilities. Though I'll admit that I'm an outlier in this respect[0]. But I think it is worth recognizing that we now have exaflop machines, that use tens of megawatts of energy, and they pale in comparison to what a 3 lb piece of fat and meat that only uses a handful of watts. In fact, our exascale computers aren't even seemingly sufficient to simulate far smaller and far less intelligent creatures. Certainly scale is a factor (we do see this pattern in apes too), but clearly there is more. And I think it should be obvious that scale isn't all you need, since we're the only ones. If it was that simple, we should see it more often. And if scale is indeed all we need, well we neither do we have an idea of how much scale that actually is nor does it mean that this is the best path forward as that scale may be ludicrously large. But what we do know, is that incredible feats can be done with what would constitute a rounding error to current scales (let alone future). I think we just want to believe this is the path forward because if it is, then there is a clear direction. But if it isn't, then we have to admit that we're still lost. But I think the problem is that we think that there's a problem in being lost. Or that we think that admitting we're lost somehow undermines or rejects the progress that we have made. But research is all about exploring the unknown. If you aren't at least a little lost, well then you're not exploring, you're reading a map. But the irony in this is that "scale is all you need" denies a lot of significant advancements we've made. Many smaller models perform far better that previously, and this is not due to knowledge transfer from larger models. Just look at any leaderboard, they aren't size is not the determining factor.
So I'd argue that if you want to advance AI, you should focus on smaller models. After all, smaller models are far easier to scale than larger models. They're also far easier to analyze and interpret, which is what gives us more information on how to lighten the way forward. But also don't expect a smaller model that is more successful to immediately be better than larger models. I far too often see a mistake even by reviewers/experts, where a method is dismissed because it was developed by some poor grad student with limited compute and did not unilaterally defeat the big models. Of course that doesn't mean the proposed methods are better, but that's orthogonal to what I'm arguing.
[0] Obviously I'm not alone. Yann is a clear believer and it's why he's looking at JEPA models (I don't think this will be enough but I think it is better). And Collet became more well known (at least outside the ML research community) and is a clear dissenter.
On another thread*, I mentioned that the XC2064 FPGA design is the paradigmatic example of an optimized minimal "hardware API", that, if you were interested in making new kinds of logic and memory-- okay you are prob thinking of higher level designs, but I want to throw out a relevant memorable example ASAP
I also wonder - is the compute the main limiting factor today ? Let's imagine there is an unlimited number of NVidia chips available right now and energy is cheap - would using a cluster x100 of current biggest one result in a significant improvement ? My naive intuition is that not really.