People are drawing erroneous conclusions from this.
My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)
You can absolutely extrapolate the results, because what this shows is that even when "reasoning" these models are still fundamentally repeating in-sample patterns, and that they collapse when faced with novel reasoning tasks above a small complexity threshold.
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.
The authors don't say anything like this that I can see. Their conclusion specifically identifies these as weaknesses of current frontier models.
Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.
If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.
I guess the authors are making an important point (that challenges the current belief & trend in AI): adding reasoning or thinking to a model (regardless of the architecture or generation)doesn’t always lead to a net gain. In fact, once you factor in compute costs and answer quality across problems of varying complexity, the overall benefit can sometimes turn out to be negative.
Human brains work the same way. Some of us are just better at analogy. I’ve worked with plenty of people who were unable to transfer knowledge of one area to another, identical area with different terms.
My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)