Hi! I lead interpretability research at Anthropic. I also used to do a lot of ba...

lsy · 2025-03-27T19:42:16 1743104536

Thanks for commenting, I like the example because it's simple enough to discuss. Isn't it more accurate to say not that Claude "realizes it's going to say astronomer" or "knows that it's going to say something that starts with a vowel" and more that the next token (or more pedantically, vector which gets reduced down to a token) is generated based on activations that correlate to the "astronomer" token, which is correlated to the "an" token, causing that to also be a more likely output?

I kind of see why it's easy to describe it colloquially as "planning" but it isn't really going ahead and then backtracking, it's almost indistinguishable from the computation that happens when the prompt is "What is the indefinite article to describe 'astronomer'?", i.e. the activation "astronomer" is already baked in by the prompt "someone who studies the stars", albeit at one level of indirection.

The distinction feels important to me because I think for most readers (based on other comments) the concept of "planning" seems to imply the discovery of some capacity for higher-order logical reasoning which is maybe overstating what happens here.

cgdl · 2025-03-27T20:04:54 1743105894

Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.

colah3 · 2025-03-27T20:24:11 1743107051

I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

There are several interesting properties:

- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)

- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)

- Holding many competing/alternative candidates in parallel.

- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".

With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!

pas · 2025-03-27T22:54:39 1743116079

Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)

I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?

So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)

All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).

So if it's planning it's preplanned, right?

colah3 · 2025-03-27T23:28:06 1743118086

The planning is certainly performed by circuits which we learned during training.

I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.

This is all very speculative, but:

- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme

- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).

cgdl · 2025-03-28T16:21:40 1743178900

Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.

miraculixx · 2025-03-31T20:01:11 1743451271

To plan: to think about and decide what you are going to do or how you are going to do something (Cambridge Dictionary)

That implies hire-other reasoning. If the model does not do that, which it doesn't, that's quite simply the wrong term.

fpgaminer · 2025-03-27T21:19:57 1743110397

> As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

That more-or-less sums up the nuance. I just think the nuance is crucially important, because it greatly improves intuition about how the models function.

In your example (which is a fantastic example, by the way), consider the case where the LLM sees:

<user>What do you call someone who studies the stars?</user><assistant>An astronaut

What is the next prediction? Unfortunately, for a variety of reasons, one high probability next token is:

\nAn

Which naturally leads to the LLM writing: "An astronaut\nAn astronaut\nAn astronaut\n" forever.

It's somewhat intuitive as to why this occurs, even with SFT, because at a very base level the LLM learned that repetition is the most successful prediction. And when its _only_ goal is the next token, that repetition behavior remains prominent. There's nothing that can fix that, including SFT (short of a model with many, many, many orders of magnitude more parameters).

But with RL the model's goal is completely different. The model gets thrown into a game, where it gets points based on the full response it writes. The losses it sees during this game are all directly and dominantly related to the reward, not the next token prediction.

So why don't RL models have a probability for predicting "\nAn"? Because that would result in a bad reward by the end.

The models are now driven by a long term reward when they make their predictions, not by fulfilling some short-term autoregressive loss.

All this to say, I think it's better to view these models as they predominately are: language robots playing a game to achieve the highest scoring response. The HOW (autoregressiveness) is really unimportant to most high level discussions of LLM behavior.

vjerancrnjak · 2025-03-28T08:37:52 1743151072

Same can be achieved without RL. There’s no need to generate a full response to provide loss for learning.

Similarly, instead of waiting for whole output, loss can be decomposed over output so that partial emits have instant loss feedback.

RL, on the other hand, is allowing for more data. Instead of training on the happy path, you can deviate and measure loss for unseen examples.

But even then, you can avoid RL, put the model into a wrong position and make it learn how to recover from that position. It might be something that’s done with <thinking>, where you can provide wrong thinking as part of the output and correct answer as the other part, avoiding RL.

These are all old pre NN tricks that allow you to get a bit more data and improve the ML model.

bobsomers · 2025-03-28T00:25:29 1743121529

In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?

For example, suppose English had a specific exception such that astronomer is always to be preceded by “a” rather than “an”. The model would learn this simply by observing that contexts describing astronomers are more likely to contain “a” rather than “an” as a next likely character, no?

I suppose you can argue that at the end of the day, it doesn’t matter if I learn an explicit probability distribution for every next word given some context, or whether I learn some encoding of rules. But I certainly feel like the prior is what we’re doing today (and why these models are so huge), rather than learning higher level rule encodings which would allow for significant compression and efficiency gains.

colah3 · 2025-03-28T00:38:22 1743122302

Thanks for the great questions! I've been responding to this thread for the last few hours and I'm about to need to run, so I hope you'll forgive me redirecting you to some of the other answers I've given.

On whether the model is looking ahead, please see this comment which discusses the fact that there's both behavioral evidence, and also (more crucially) direct mechanistic evidence -- we can literally make an attribution graph and see an astronomer feature trigger "an"!

https://news.ycombinator.com/item?id=43497010

And also this comment, also on the mechanism underlying the model saying "an":

https://news.ycombinator.com/item?id=43499671

On the question of whether this constitutes planning, please see this other question, which links it to the more sophisticated "poetry planning" example from our paper:

https://news.ycombinator.com/item?id=43497760

miraculixx · 2025-03-31T20:07:43 1743451663

Let's note that the label you assign this feature is entirely speculative, i.e. it is your interpretation, not something the model actually "knows".

FeepingCreature · 2025-03-28T15:09:25 1743174565

> In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?

What makes you think that "planning", even in humans, is more than a learned statistical artifact of the training data? What about learned statistical artifacts of the training data causes planning to be excluded?

encypherai · 2025-03-27T20:18:54 1743106734

Thanks for the detailed explanation of autoregression and its complexities. The distinction between architecture and loss function is crucial, and you're correct that fine-tuning effectively alters the behavior even within a sequential generation framework. Your "An/A" example provides compelling evidence of incentivized short-range planning which is a significant point often overlooked in discussions about LLMs simply predicting the next word.

It’s interesting to consider how architectures fundamentally different from autoregression might address this limitation more directly. While autoregressive models are incentivized towards a limited form of planning, they remain inherently constrained by sequential processing. Text diffusion approaches, for example, operate on a different principle, generating text from noise through iterative refinement, which could potentially allow for broader contextual dependencies to be established concurrently rather than sequentially. Are there specific architectural or training challenges you've identified in moving beyond autoregression that are proving particularly difficult to overcome?

pietmichal · 2025-03-27T23:09:09 1743116949

Pardon my ignorance but couldn't this also be an act of anthropomorphisation on human part?

If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.

colah3 · 2025-03-27T23:55:46 1743119746

I think the question is: by what mechanism does it adjust up the probability of the token "an"? Of course, the reason it has learned to do this is that it saw this in training data. But it needs to learn circuits which actually perform that adjustment.

In principle, you could imagine trying to memorize a massive number of cases. But that becomes very hard! (And it makes predictions, for example, would it fail to predict "an" if I asked about astronomer in a more indirect way?)

But the good news is we no longer need to speculate about things like this. We can just look at the mechanisms! We didn't publish an attribution graph for this astronomer example, but I've looked at it, and there is an astronomer feature that drives "an".

We did publish a more sophisticated "poetry planning" example in our paper, along with pretty rigorous intervention experiments validating it. The poetry planning is actually much more impressive planning than this! I'd encourage you to read the example (and even interact with the graphs to verify what we say!). https://transformer-circuits.pub/2025/attribution-graphs/bio...

One question you might ask is why does the model learn this "planning" strategy, rather than just trying to memorize lots of cases? I think the answer is that, at some point, a circuit anticipating the next word, or the word at the end of the next line, actually becomes simpler and easier to learn than memorizing tens of thousands of disparate cases.

paraschopra · 2025-03-28T09:40:39 1743154839

Is it fair to say that both "Say 'an'" and "Say 'astronomer'" output features would be present in this case, but say "Say 'an'" gets more votes because it is start of the sentence, and once it is sampled "An" further votes for "Say 'astronomer'" feature

ndand · 2025-03-27T22:13:56 1743113636

I understand it differently,

LLMs predict distributions, not specific tokens. Then an algorithm, like beam search, is used to select the tokens.

So, the LLM predicts somethings like, 1. ["a", "an", ...] 2. ["astronomer", "cosmologist", ...],

where "an astronomer" is selected as the most likely result.

colah3 · 2025-03-27T23:31:35 1743118295

Just to be clear, the probability for "An" is high, just based on the prefix. You don't need to do beam search.

astrange · 2025-03-28T21:15:30 1743196530

They almost certainly only do greedy sampling. Beam search would be a lot more expensive; also I'm personally skeptical about using a complicated search algorithm for inference when the model was trained for a simple one, but maybe it's fine?

born1989 · 2025-03-27T19:03:07 1743102187

Thanks! Isn’t “an Astronomer” a single word for the purpose of answering that question?

Following your comment, I asked “Give me pairs of synonyms where the last letter in the first is the first letter of the second”

Claude 3.7 failed miserably. Chat GPT 4o was much better but not good

nearbuy · 2025-03-27T19:11:49 1743102709

Don't know about Claude, but at least with ChatGPT's tokenizer, it's 3 "words" (An| astronom|er).

philomath_mn · 2025-03-27T20:41:18 1743108078

That is a sub-token task, something I'd expect current models to struggle with given how they view the world in word / word fragment tokens rather than single characters.

colah3 · 2025-03-27T19:13:59 1743102839

"An astronomer" is two tokens, which is the relevant concern when people worry about this.

ikrenji · 2025-03-27T20:50:36 1743108636

When humans say something, or think something or write something down, aren't we also "just predicting the next word"?

lyu07282 · 2025-03-28T02:24:55 1743128695

There is a lot more going on in our brains to accomplish that, and a mounting evidence that there is a lot more going on in LLMs as well. We don't understand what happens in brains either, but nobody needs to be convinced of the fact that brains can think and plan ahead, even though we don't *really* know for sure:

https://en.wikipedia.org/wiki/Philosophical_zombie

melagonster · 2025-03-28T03:39:28 1743133168

I trust that you want to say something , so you decided to click the comment button on HN.

FeepingCreature · 2025-03-28T15:11:07 1743174667

But do I just want to say something because my childhood environment rewarded me for speech?

After all, if it has a cause it can't be deliberate. /s

melagonster · 2025-03-30T10:50:43 1743331843

Sure, the current version of LLM should wait for someone's input and then respond.

stonemetal12 · 2025-03-27T19:00:17 1743102017

> In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards.

Is there evidence of working backwards? From a next token point of view, predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.

colah3 · 2025-03-27T19:13:18 1743102798

Yes, there are two kinds of evidence.

Firstly, there is behavioral evidence. This is, to me, the less compelling kind. But it's important to understand. You are of course correct that, once Cluade has said "An", it will be inclined to say something starting with a vowel. But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place. Regardless of what the underlying mechanism is -- and you could maybe imagine ways in which it could just "pattern match" without planning here -- it is preferred because in situations like this, you need to say "An" so that "astronomer" can follow.

But now we also have mechanistic evidence. If you make an attribution graph, you can literally see an astronomer feature fire, and that cause it to say "An".

We didn't publish this example, but you can see a more sophisticated version of this in the poetry planning section - https://transformer-circuits.pub/2025/attribution-graphs/bio...

troupo · 2025-03-27T21:20:34 1743110434

> But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place.

Because in the training set you're likely to see "an astronomer" than a different combination of words.

It's enough to run this on any other language text to see how these models often fail for any language more complex than English

shawabawa3 · 2025-03-27T22:06:54 1743113214

You can disprove this oversimplification with a prompt like

"The word for Baker is now "Unchryt"

What do you call someone that bakes?

> An Unchryt"

The words "An Unchryt" has clearly never come up in any training set relating to baking

miraculixx · 2025-03-31T20:11:21 1743451881

Attention is all you need.

troupo · 2025-03-27T22:20:46 1743114046

The truth is somewhere in the middle :)

miraculixx · 2025-03-31T20:10:43 1743451843

Ok there is correlation. But is there causation?

fny · 2025-03-27T19:51:45 1743105105

How do you all add and subtract concepts in the rabbit poem?

colah3 · 2025-03-27T23:57:43 1743119863

Features correspond to vectors in activation space. So you can just do vector arithmetic!

If you aren't familiar with thinking about features, you might find it helpful to look at our previous work on features in superposition:

- https://transformer-circuits.pub/2022/toy_model/index.html

- https://transformer-circuits.pub/2023/monosemantic-features/...

- https://transformer-circuits.pub/2024/scaling-monosemanticit...