There is a reasonable argument for that (heard the idea go around between multip...

There is a reasonable argument for that (heard the idea go around between multiple AI engineers, that once you go past a certain scale, it does not matter for its evals)

One of the biggest issue for testing all of this, is it takes a crap ton of GPUs to prove all the alternatives to transformers beyond 1B param.

For example I’m waiting for someone to do a 1B-14B text based diffusion network

Finally, if this is truely the case (and all that really matter is size+dataset)

We really should use an architecture that is cheaper to train and run. And that’s what RWKV represents here

You can even run the 7B quantized model reasonably on most laptops (try the rwkv-cpp / rwkv-cpp-node project)