Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There was no leap in research. Everything had to do with availability of compute.

Neural nets are quite old, and everyone knew that they were universal function approximators. The reason why models never took off was because it was very expensive to train a model even of a limited size. There was no real available hardware to do this on short of supercomputer clusters, which were just all cpus, and thus wildly inefficient. But any researcher back then would have told you that you can figure anything out with neural nets.

Sometime in 2006, Nvidia realized that a lot of the graphics compute was just generic parallel compute and released Cuda. People started using graphics cards for compute. Then someone figured out you can actually train deep neural nets with decent speed.

Transformers wasn't even that big of a leap. The paper makes it sound like its some sort of novel architecture - in essence, instead of inputweights to next layer, you do inputmatrix1, inputmatrix2, inputmatrix3, and multiply them together. And as you guessed this, to train it you need more hardware because now you have to train 3 matrices rather than just one.

If we ever get like ASIC for ml, basically at a certain point, we will be able to iterate on architectures itself. The optimal LLM may be a combination of CNN,RNN, and Transformer blocks, all interwtined.



> ever get like ASIC for ml

Is this what you're mentioning?

[0] https://linearmicrosystems.com/using-asic-chips-for-artifici...


Sort of. Those are just inference chips. You wouldn't be able to iterate architecture.

In terms of math, every single transformer can be expressed as a sequence of deep layers, so you could have an ASIC laid out in such a way where the architecture of the model depends on where you put the zeros.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: