In case people are wondering why Mamba is exciting:
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.
> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.
“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.
But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.
Indeed: https://arxiv.org/pdf/2402.01032.pdf
Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!
Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752