You need to think about 1) the latent state 2) the fact that part of the model is post trained to bias the MC towards abiding by the query in the sense of the reward.
A way to look at it is that you effectively have 2 model "heads" inside the LLM, one which generates, one which biases/steers.
The MCMC is initialised based on your prompt, the generator part samples from the language distribution it has learned, while the sharpening/filtering part biases towards stuff that would be likely to have this MCMC give high rewards in the end. So the model regurgitates all the context that is deemed possibly relevant based on traces from the training data (including "tool use", which then injects additional context) and all those tokens shift the latent state into something that is more and more typical of your query.
Importantly, attention acts as a Selector and has multiple heads, and these specialize, so (simplified) one head can maintain focus on your query and "judge" the latent state, while the rest can follow that Markov chain until some subset of the generated+tool injected tokens give enough signal to the "answer now" gate that the middle flips into "summarizing" mode, which then uses the latent state of all of those tokens to actually generate the answer.
So you very much can think of it as sampling repeatedly from an MCMC using a bias, A learned stoping rule and then having a model creating the best possible combination of the traces, except that all this machinery is encoded in the same model weights that get to reuse features between another, for all the benefits and drawbacks that yields.
There was a paper close when OF became a thing that showed that instead of doing CoT, you could just spend that token budget on K parallel shorter queries (by injecting sth. Like "ok, to summarize" and "actually" to force completion ) and pick the best one/majority vote. Since then RLHF has made longer traces more in distribution (although there's another paper that showed as of early 2025 you were trading reduced variance and peak performance as well as loss of edge cases for higher performance on common cases , although this might be ameliorated by now) but that's about the way it broke down 2024-2025
I'd encourage everyone to learn about Metropolis Hastings Markov chain monte carlo and then squint at lmms, think about what token by token generation of the long rollouts maps to in that framework and consider that you can think of the stop token as a learned stopping criterion accepting (a substring of) the output
I have a tiny tiny podcast with a friend where we try to break down what parts of the hype are bullshit (muck) and which kernels of truth are there, if any, startedpartially as a place to scream into the void, partially to help the people who are anxious about AGI or otherwise bring harmed by the hype. I think we have a long way to go in terms of presentation (breaking down very technical terms to an audience that is used to vague-hype around "AI" is hard), but we cite our sources, maybe it'll be interesting gpr you to check out out shownotes
I personally struggle with Gary Marcus critiques because whenever they are about "making ai work" it goes into neurosymbploc "AI" which o have technical disagreements with, and I have _other_ arguments for the points he sometimes raises which I think are more rigorous, so it's difficult to be roughly in the same camp - but overall I'm happy someone with reach is calling BS ad well.
Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms
Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect
Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.
I really believe in the importance of praising people and acknowledging their efforts, when they are kind and good human beings and (to much lesser degree) their successes.
But, and I mean their without snark: What value is your praise for what is good if I cannot trust that you will be critical of what is bad? Note that critique can be unpleasant but kind, and I don't care for "brutal honesty" (which is much more about the brutality than the honesty in most cases).
But whether it's the joint Slavic-german culture or something else, I much prefer for things to be _appropriate_, _kind_ and _earnest_ instead of just supportive or positive. Real love is despite a flaw, in full cognizance if it, not ignoring them.
Yeah, I live in sweden and a compliment by a swede about how I play music is completely meaningless to me. On the other hand a compliment from my bosnian or croatian friends is a big deal.
> I really believe in the importance of praising people and acknowledging their efforts, when they are...
alive!
At a funeral of a controversial activist, where all the living activist sang their praise, I watch their child stand up and say "...where were you all when my dad was alive"
I now go out of my way to tell people I admire them, if I do, while they are still here.
Not 'why do they work?' but rather 'what are they able to do, and what are they not?'
To understand why they work only requires an afternoon with an AI textbook.
What's hard is to predict the output of a machine that synthesises data from millions of books and webpages, and does so in a way alien to our own thought processes.
Check the actual paper on the type of sorts it actually got speedup on :-) (hint: a few percentage points on larger n,similar to what pgo might find, the big speedup is for n around 8 or so, where it basically enumerated and found a sorting network)
Nah, it's much simpler, the models aren't reliably able to recall the correct rule from memory - it's im the training set for sure.
This is another specialized synthetic data generation pipeline for a curriculum for one particular algorithm cluster to be encoded into the weights, not more not less. They even mention quality control still beim important
My comment was half in gentle jest, but you have really found a relevant example there - thanks for sharing. I guess I have a potential PhD project if I ever want to get into maritime biology: "The Dark Side of the Fish: How Crafty Parasites Outmaneuver High-Tech Pest Control"
I’ve noticed that selective pressure on shrimp in an aquarium leads to colour morphs very, very quickly. I bet lice are the same.
I’ve had tanks where shrimp will match their surroundings quite closely within a year or so. This would be due to some micro predator being present and picking off any babies that are easy to see. Shrimp have babies frequently and the only ones that survive in those conditions are able to blend in really well. Every 2.5 months or so a new generation becomes sexually mature and has dozens of babies every 8 weeks or so.
It’s a fun hobby because you can actually develop your own morphs in a matter of only years if you want to.
So it’s not necessarily evolution, but pigments developing (or not) due to environmental pressures.
A way to look at it is that you effectively have 2 model "heads" inside the LLM, one which generates, one which biases/steers.
The MCMC is initialised based on your prompt, the generator part samples from the language distribution it has learned, while the sharpening/filtering part biases towards stuff that would be likely to have this MCMC give high rewards in the end. So the model regurgitates all the context that is deemed possibly relevant based on traces from the training data (including "tool use", which then injects additional context) and all those tokens shift the latent state into something that is more and more typical of your query.
Importantly, attention acts as a Selector and has multiple heads, and these specialize, so (simplified) one head can maintain focus on your query and "judge" the latent state, while the rest can follow that Markov chain until some subset of the generated+tool injected tokens give enough signal to the "answer now" gate that the middle flips into "summarizing" mode, which then uses the latent state of all of those tokens to actually generate the answer.
So you very much can think of it as sampling repeatedly from an MCMC using a bias, A learned stoping rule and then having a model creating the best possible combination of the traces, except that all this machinery is encoded in the same model weights that get to reuse features between another, for all the benefits and drawbacks that yields.
There was a paper close when OF became a thing that showed that instead of doing CoT, you could just spend that token budget on K parallel shorter queries (by injecting sth. Like "ok, to summarize" and "actually" to force completion ) and pick the best one/majority vote. Since then RLHF has made longer traces more in distribution (although there's another paper that showed as of early 2025 you were trading reduced variance and peak performance as well as loss of edge cases for higher performance on common cases , although this might be ameliorated by now) but that's about the way it broke down 2024-2025
reply