Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While the output is a single word (more precisely, token), the internal activations are very high dimensional and can already contain information related to words that will only appear later. This information is just not given to the output at the very last layer. You can imagine the internal feature vector as encoding the entire upcoming sentence/thought/paragraph/etc. and the last layer "projects" that down to whatever the next word (token) has to be to continue expressing this "thought".


But the activations at some point lead to a 100% confidence that the right word has been identified for the current slot. That is output, and it proceeds to the next one.

Like for a 500 token response, at some point it was certain that the first 25 words are the right ones, such that it won't have to take any of them back when eventually calculating the last 25.


This is true, but it doesn't mean that it decided those first 25 without "considering" whether those 25 can be afterwards continued meaningfully with further 25. It does have some internal "lookahead" and generates things that "lead" somewhere. The rhyming example from the article is a great choice to illustrate this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: