Yeah, it always seemed pretty wasteful to me. In every single forward pass the L...

colah3 · 2025-03-28T00:16:17 1743120977

> The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain.

Hi! I lead interpretability research at Anthropic.

That's a great intuition, and in fact the transformer architecture actually does exactly what you suggest! Activations from earlier time steps are sent forward to later time steps via attention. (This is another thing that's lost in the "models just predict the next word" framing.)

This actually has interesting practical implications -- for example, in some sense, it's the deep reason costs can sometimes be reduced via "prompt caching".

bonoboTP · 2025-03-28T01:04:54 1743123894

I'm more a vision person, and haven't looked a lot into NLP transformers, but is this because the attention is masked to only allow each query to look at keys/values from its own past? So when we are at token #5, then token #3's query cannot attend to token #4's info? And hence the previously computed attention values and activations remain the same and can be cached, because it would anyway be the same in the new forward pass?

colah3 · 2025-03-28T03:42:25 1743133345

Yep, that’s right!

If you want to be precise, there are “autoregressive transformers” and “bidirectional transformers”. Bidirectional is a lot more common in vision. In language models, you do see bidirectional models like Bert, but autoregressive is dominant.