The goal isnt to assess the LLM capability at solving any of those problems. The point isnt how good they are at block world puzzles.
The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.
Towers of Hanoi IS an algorithmic problem. It is a high-school/college level problem when designing algorithms, probably kid level when trying to solve intuitively, heuristically or via brute force for few disks (i.e. like when playing Mass Effect 1 or similar games that embed it as a minigame*).
The problems themselves aren’t particularly interesting, I suppose. The interesting part is how the complexity of each problem scales as a function of the number of inputs (e.g. the number of disks in the tower of Hanoi).
The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.