I struggled reading the papers - Anthropic’s white papers reminds me of Stephen Wolfram, where it’s a huge pile of suggestive empirical evidence, but the claims are extremely vague - no definitions, just vibes - the empirical evidence seems selectively curated, and there’s not much effort spent building a coherent general theory.
Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?
I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.
>The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”
I'm honestly confused at what you're getting at here. It doesn't matter why Claude chose rabbit to plan around and in fact likely did do so because of carrot, the point is that it thought about it beforehand. The rabbit concept is present as the model is about to write the first word of the second line even though the word rabbit won't come into play till the end of the line.
>later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”!
It's not supposed to rhyme. That's the point. They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.
> Here, we modified the part of Claude’s internal state that represented the "rabbit" concept. When we subtract out the "rabbit" part, and have Claude continue the line, it writes a new one ending in "habit", another sensible completion. We can also inject the concept of "green" at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in "green". This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.
This all seems explainable via shallow next-token prediction. Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme, but overriding it with green means the system cannot adapt? Why didn't it say "green habit" or something? It seems like Anthropic is having it both ways: Claude continued to rhyme after deleting the concept, which demonstrates planning, but also Claude coherently filled in the "green" line despite it not rhyming, which...also demonstrates planning? Either that concept is "last word" or it's not! There is a tension that does not seem coherent to me, but maybe if they had n=2 instead of n=1 examples I would have a clearer idea of what they mean. As it stands it feels arbitrary and post hoc. More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.
>Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme,
Again, the model has the first line in context and is then asked to write the second line. It is at the start of the second line that the concept they are talking about is 'born'. The point is to demonstrate that Claude thinks about what word the 2nd line should end with and starts predicting the line based on that.
It doesn't forget about the -bit rhyme because that doesn't make any sense, the first line ends with it and you just asked it to write the 2nd line. At this point the model is still choosing what word to end the second line in (even though rabbit has been suppressed) so of course it still thinks about a word that rhymes with the end of the first line.
The 'green' but is different because this time, Anthropic isn't just suppressing one option and letting the model choose from anything else, it's directly hijacking the first choice and forcing that to be something else. Claude didn't choose green, Anthropic did. That it still predicted a sensible line is to demonstrate that this concept they just hijacked is indeed responsible for determining how that line plays out.
>More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.
They didn't rule out anything. You just didn't understand what they were saying.
>They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.
I think the confusion here is from the extremely loaded word "concept" which doesn't really make sense here. At best, you can say that Claude planned that the next line would end with the word rabbit and that by replacing the internal representation of that word with another word lead the model to change.
I wonder how many more years will pass, and how many more papers will Anthropic have to release, before people realize that yes, LLMs model concepts directly, separately from words used to name those concepts. This has been apparent for years now.
And at least in the case discussed here, this is even shown in the diagrams in the submission.
We'll all be living in a Dyson swarm around the sun as the AI eats the solar system around us and people will still be confident that it doesn't really think at all.
Agreed. They’ve discovered something, that’s for sure, but calling it “the language of thought” without concrete evidence is definitely begging the question.
Came here to say this. Their paper reels of wishful thinking and labeling things in terms they prefer it would be. They even note in one place their replacement model has a 50% accuracy which is simply a fancy way to say the model's result is completely by chance, and it could be interpreted one way or another. Like throwing a coin.
In reality all that's happening is drawing samples on the joint probability of the tokens in the context window. That's what the model is designed to do, trained to do - and that's exactly what it does. More precisely that is what the algorithm does, using the model weights, the input ("prompt", tokenized) and the previously generated output, one token at a time. Unless the algorithm is started (by a human, ultimately), nothing happens. Note how entirely different that is to any living being that actually thinks.
All interpretation above and beyond that is speculative and all intelligence found is entirely human.
tangent: this is the second time today I've seen an HN commenter use "begging the question" with its original meaning. I'm sorry to distract with a non-helpful reply, it's just I can't remember the last time I've seen that phrase in the wild to refer to a logical fallacy — even begsthequestion.info [0] has given up the fight.
(I don't mind language evolving over time, but I also think we need to save the precious few phrases we have for describing logical fallacies)
Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?
I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.