Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In case people are wondering why Mamba is exciting:

There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.

Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.

For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752



People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.


you linked a paper about vision transformers...


Being used as a comparison...

From the abstract:

> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.


“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.


But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.


Indeed: https://arxiv.org/pdf/2402.01032.pdf Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!


I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.


Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: