You can absolutely extrapolate the results, because what this shows is that even when "reasoning" these models are still fundamentally repeating in-sample patterns, and that they collapse when faced with novel reasoning tasks above a small complexity threshold.
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.
I guess the authors are making an important point (that challenges the current belief & trend in AI): adding reasoning or thinking to a model (regardless of the architecture or generation)doesn’t always lead to a net gain. In fact, once you factor in compute costs and answer quality across problems of varying complexity, the overall benefit can sometimes turn out to be negative.
The authors don't say anything like this that I can see. Their conclusion specifically identifies these as weaknesses of current frontier models.
Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.
If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.
Human brains work the same way. Some of us are just better at analogy. I’ve worked with plenty of people who were unable to transfer knowledge of one area to another, identical area with different terms.
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.