This is a discussion of the limits of the planning capabilities of a transformer model - specifically a transformer LLM.
I claim that a transformer architecture can fundamentally not solve the following language task:
Generate three integers, separated by commas, and then a stop symbol. Don't generate anything else. The first integer is >500 digits long and the other two integers are prime numbers >250 digits long which factor the first integer.
Of course it is a difficult problem to even find such large prime numbers, but this is conceptually solvable given a very smart transformer. The problem is that a transformer can not "forward-plan". As there is no hidden state, no "memory" or "intention" can be passed to the future self on why a specific token was chosen.
To generate the first number, it is algorithmically only feasible to think of two prime numbers and multiply them. But when it is time to generate the second and third number, the transformer has forgotten which factors it chose. To solve the problem it has to factor the first number. This is computationally unfeasible.
By the above reasoning, one can say that transformers can only solve tasks that a row-of-humans model can solve.
A row-of-humans is a row of people which act as a language model. In sequence, each person says exactly one word after having heard all the previously spoken words.
This row-of-humans could similarly not solve the above task.
There are many other examples of such tasks that require active "forward-planning", i.e. telling your future self what your intention was when you took an action.
A row-of-humans can not generate a text backwards. Well, of course they can but my point is that it is computationally qualitatively much more difficult for a row-of-people than an individual.
These issues can be resolved by prompt engineering chain-of-thought behavior, i.e. using your output to take notes, store memory, and thereby pass information to yourself in the future.
This is why transformers do very badly when additional constraints (like only use "e" 10 times) are introduced to the form of their output on top of solving a task. I believe this is part of why an LLM can not generate sentences that end on a specific letter well.
Another example is that generating a mystery novel is fundamentally more difficult for a row-of-people than for an individual. Say each person writes a chapter and suppose the first chapter sets out an ingenious mystery. The other people have to find a good resolution for the mystery, which is fundamentally harder than coming up with a resolution and its mystery.
This is basically how billion dollar franchises nowadays end up with botched stories lol.
Now, a row-of-people can still solve all of these tasks, but it has to act differently to an individual. It has to either use its output for memory and take notes, or in some cases use multi-agent reasoning (you can construct prisoner-hat puzzles for transformers).
While you could conceptually prompt-engineer your transformer to solve these problems, you will ideally want the transformer to prompt-engineer itself, i.e. come up on its own with an algorithm that a row-of-people can use to solve a problem in situations where this is fundamentally different to the algorithm an individual would use. But this is not included in the data we give transformers.
I could imagine a dataset of row-of-people recorded data where each language (completion) task is solved with row-of-people strategies, but a transformer is trained on human data and hence will not naturally use these strategies.
This is the first instance I have seen so for of a class of problems that are fundamentally difficult for a transformer and hence I think it is important.
This is probably your misconception : "Transformers are RNNs" https://arxiv.org/pdf/2006.16236.pdf (see section 3.4)
Even though an inner state is not specified explicitly like in a LSTM, for a transformer it's possible to view the inner state as implicitly defined as the inner features (deterministically derived from the layers weights) applied to the past inputs.
Therefore transformers can learn to do the task by learning to remember a,b,ab to this inner state and write them in order ab,a,b (Though to train it you'll probably have to give it an initial random context to be able to produce different sequences : From this source of entropy it will learn to produce 2 hidden variables which it will use to produce a,b,and ab)
I concede that it will not be an easy task to learn in this way, and that it's probably easier to train a chain of thought model where you will allow it to use its output to remember the state (by writing it down), instead of having to memorize in its weights. So if you train it to produce a,b,ab,a,b and then another transformer to extract the last 3 outputs a*b,a,b it become very easy.