In figure 1 bottom-right they show how the correct answers are being found later as the complexity goes higher.
In the description they even state that in false responses the LRM often focusses on a wrong answer early and then runs out of tokens before being able to self-correct.
This seems obvious and indicates that it’s simply a matter of scaling (bigger token budget would lead better abilities for complexer tasks). Am I missing something?