There is a tick-tock between searching the dominant NN architectures (tick) and optimizing for accuracy, compute and inference latency and throughput (tock).
This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.
You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.
If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.
Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.
I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No.
What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?
Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.
My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"
IMO, SSMs are an optimization. They don't represent enough of a fundamental departure from the kinds of things Transformers can _do_. So, while I like the idea of saving on the energy costs, I speculate that such saving can be obtained with other optimizations while staying with transformer blocks. Hence, the motivation to change is a bit of an uphill here. I would love to hear counter-arguments to this view. :)
Furthermore, I think a replacement will require that we _understand_ what the current crop of models are doing mechanically. Some of it was motivated in [1].
Quadratic vs linear is not an optimization. It's a completely new game. With selective SSMs (mamba) the win is that associative training can be run in sublinear time via a log-cost associative scan. So you go from something quadratic wrt input sequence length to something logarithmic. If that's just an optimization it's a huge one.
Okay. Respect your point of view. I am curious, what applications do you think SSMs enable that a Transformer cannot? I have always seen it as a drop-in replacement (like for like) but maybe there is more to it.
Personally, I think going linear instead of quadratic for a core operation that a system needs to do is by definition an optimization.
Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)
There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).
The solution to agi is not deep learning maybe with more compute and shit load of engineering it can work kind of baby agi.
My bet will be on something else than gradient descent and backprop but really I don't wish any company or country to reach agi or any sophisticated ai ...
Magical thinking. Nature uses gradient descent to evolve all of us and our companions on this planet. If something better were out there, we would see it at work in the natural world.
Are you also saying that thoughts are formed using gradient descent? I don't think gradient descent is an accurate way to describe either process in nature. Also, we don't know that we "see" everything that is happening, we don't even understand the brain yet.
It's a metaphor that's been used with the advancement of CPU designs at least as far back as the 80s or 90s. Intel uses it explicitly in their marketing nowadays, I believe.
The innovation is the amount of resources people are willing to spend right now. From looking at the research code it's clear that the whole field is basically doing a (somewhat) guided search in the entire space of possible layer permutations.
There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.
There's definitely scientific insight and analysis.
E.g. "In-context Learning and Induction Heads" is an excellent paper.
Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.
The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.
Attention provides information routing. Again, that is pretty well-understood.
The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.
So this architecture is not so much accidental as it is general.
Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.
Note that not all brains are so severely damaged with this illusion. Most of them actually get pretty clearly that they are next to useless without its organic, social and environmental companions.
yes. the flagged comment wanted to sell moffkalast a bridge, w/o further explanation. i interpreted it as him saying that the human brain isn't governed by evolution and calling moffkalast naive.
The only thing that has changed since 2018 is the most popular network structure to play with. The code looks the same as always; python notebooks where someone manually calculated the size of each hard-coded layer to make it fit.
I don't know, shouldn't the AI then be trapped at evaluating all possible AI implementations? And since it will face the halting problem, it won't discriminate the very best one, though it will probably be able to return the best one given a capped amount of resources that is reachable through exhaustion in its space. It won't necessarily be better than what can be provided by human beings given an equivalent amount of resources.
There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.
I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.
I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.
There are things like NAS (neural architectural search) but all you are doing is just growing the search space and making the optimization problem much harder. Typically you do the architectural optimization by hand, using heuristics and past experiments as guidance.
I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.