If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.
Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.
There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).
> low quality data (the Pile only)
The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.
Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.
Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.