> The models still _generate_ response token-by-token, but they pick tokens _not_ based on tokens that maximize probabilities at each token.
This is also not how base training works. In base training the loss is chosen given a context, which can be gigantic. It's never about just the previous token, it's about a whole response in context. The context could be an entire poem, a play, a worked solution to a programming problem, etc, etc. So you would expect to see the same type of (apparent) higher-level planning from base trained models and indeed you do and can easily verify this by downloading a base model from HF or similar and prompting it to complete a poem.
The key differences between base and agentic models are 1) the latter behave like agents, and 2) the latter hallucinate less. But that isn't about planning (you still need planning to hallucinate something). It's more to do with post-base training specifically being about providing positive rewards for things which aren't hallucinations. Changing the way the reward function is computed during RL doesn't produce planning, it simply inclines to model to produce responses that are more like the RL targets.
In general the nitpicking seems weird. Yes, on a mechanical level, using a model is still about "given this context, what is the next token". No, that doesn't mean that they don't plan, or have higher-level views of the overal structure of their response, or whatever.
This is also not how base training works. In base training the loss is chosen given a context, which can be gigantic. It's never about just the previous token, it's about a whole response in context. The context could be an entire poem, a play, a worked solution to a programming problem, etc, etc. So you would expect to see the same type of (apparent) higher-level planning from base trained models and indeed you do and can easily verify this by downloading a base model from HF or similar and prompting it to complete a poem.
The key differences between base and agentic models are 1) the latter behave like agents, and 2) the latter hallucinate less. But that isn't about planning (you still need planning to hallucinate something). It's more to do with post-base training specifically being about providing positive rewards for things which aren't hallucinations. Changing the way the reward function is computed during RL doesn't produce planning, it simply inclines to model to produce responses that are more like the RL targets.
Karpathy has a good intro video on this. https://www.youtube.com/watch?v=7xTGNNLPyMI
In general the nitpicking seems weird. Yes, on a mechanical level, using a model is still about "given this context, what is the next token". No, that doesn't mean that they don't plan, or have higher-level views of the overal structure of their response, or whatever.