The point they draw is that the results are non sequiturs. Reasoning models produce chains of thought (and are sometimes even correct in the chain of thought) and still produce a wrong, logically inconsistent answer. The extreme examples of this is giving the model a step by step guide to complete the program (the Towers program) and it being unable to produce answers which are consistent with the provided plan. So, not only is it unable to produce robust chain of thought, even were it correct and explicit, it cannot mash this information into a reasoned response.