From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?
Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.
There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.
Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.
It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.
Exactly this. Except, there is zero chance they just looked at mamba and went "meh, too new for us". People are definitely trying stuff. It takes a lot of fiddling around with a brand new model architecture to get something working well. OpenAI aren't going to give a running commentary on the state of all the things they are looking into.