In case people are wondering why Mamba is exciting: There's this idea in AI righ...

KuriousCat · on Feb 23, 2024

People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.

jdeaton · on Feb 24, 2024

you linked a paper about vision transformers...

hervature · on Feb 24, 2024

Being used as a comparison...

From the abstract:

> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.

hansonw · on Feb 23, 2024

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

shikon7 · on Feb 24, 2024

But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.

hansonw · on Feb 24, 2024

Indeed: https://arxiv.org/pdf/2402.01032.pdf Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!

5kg · on Feb 23, 2024

I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.

nickpsecurity · on Feb 23, 2024

Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.