Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is a reasonable argument for that (heard the idea go around between multiple AI engineers, that once you go past a certain scale, it does not matter for its evals)

One of the biggest issue for testing all of this, is it takes a crap ton of GPUs to prove all the alternatives to transformers beyond 1B param.

For example I’m waiting for someone to do a 1B-14B text based diffusion network

Finally, if this is truely the case (and all that really matter is size+dataset)

We really should use an architecture that is cheaper to train and run. And that’s what RWKV represents here

You can even run the 7B quantized model reasonably on most laptops (try the rwkv-cpp / rwkv-cpp-node project)



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: