Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What a transformer can NOT do
14 points by golol on Jan 19, 2023 | hide | past | favorite | 16 comments
This is a discussion of the limits of the planning capabilities of a transformer model - specifically a transformer LLM.

I claim that a transformer architecture can fundamentally not solve the following language task:

Generate three integers, separated by commas, and then a stop symbol. Don't generate anything else. The first integer is >500 digits long and the other two integers are prime numbers >250 digits long which factor the first integer.

Of course it is a difficult problem to even find such large prime numbers, but this is conceptually solvable given a very smart transformer. The problem is that a transformer can not "forward-plan". As there is no hidden state, no "memory" or "intention" can be passed to the future self on why a specific token was chosen. To generate the first number, it is algorithmically only feasible to think of two prime numbers and multiply them. But when it is time to generate the second and third number, the transformer has forgotten which factors it chose. To solve the problem it has to factor the first number. This is computationally unfeasible.

By the above reasoning, one can say that transformers can only solve tasks that a row-of-humans model can solve.

A row-of-humans is a row of people which act as a language model. In sequence, each person says exactly one word after having heard all the previously spoken words.

This row-of-humans could similarly not solve the above task.

There are many other examples of such tasks that require active "forward-planning", i.e. telling your future self what your intention was when you took an action.

A row-of-humans can not generate a text backwards. Well, of course they can but my point is that it is computationally qualitatively much more difficult for a row-of-people than an individual.

These issues can be resolved by prompt engineering chain-of-thought behavior, i.e. using your output to take notes, store memory, and thereby pass information to yourself in the future. This is why transformers do very badly when additional constraints (like only use "e" 10 times) are introduced to the form of their output on top of solving a task. I believe this is part of why an LLM can not generate sentences that end on a specific letter well.

Another example is that generating a mystery novel is fundamentally more difficult for a row-of-people than for an individual. Say each person writes a chapter and suppose the first chapter sets out an ingenious mystery. The other people have to find a good resolution for the mystery, which is fundamentally harder than coming up with a resolution and its mystery. This is basically how billion dollar franchises nowadays end up with botched stories lol.

Now, a row-of-people can still solve all of these tasks, but it has to act differently to an individual. It has to either use its output for memory and take notes, or in some cases use multi-agent reasoning (you can construct prisoner-hat puzzles for transformers).

While you could conceptually prompt-engineer your transformer to solve these problems, you will ideally want the transformer to prompt-engineer itself, i.e. come up on its own with an algorithm that a row-of-people can use to solve a problem in situations where this is fundamentally different to the algorithm an individual would use. But this is not included in the data we give transformers.

I could imagine a dataset of row-of-people recorded data where each language (completion) task is solved with row-of-people strategies, but a transformer is trained on human data and hence will not naturally use these strategies.

This is the first instance I have seen so for of a class of problems that are fundamentally difficult for a transformer and hence I think it is important.




>As there is no hidden state, no "memory" or "intention" can be passed to the future

This is probably your misconception : "Transformers are RNNs" https://arxiv.org/pdf/2006.16236.pdf (see section 3.4)

Even though an inner state is not specified explicitly like in a LSTM, for a transformer it's possible to view the inner state as implicitly defined as the inner features (deterministically derived from the layers weights) applied to the past inputs.

Therefore transformers can learn to do the task by learning to remember a,b,ab to this inner state and write them in order ab,a,b (Though to train it you'll probably have to give it an initial random context to be able to produce different sequences : From this source of entropy it will learn to produce 2 hidden variables which it will use to produce a,b,and ab)

I concede that it will not be an easy task to learn in this way, and that it's probably easier to train a chain of thought model where you will allow it to use its output to remember the state (by writing it down), instead of having to memorize in its weights. So if you train it to produce a,b,ab,a,b and then another transformer to extract the last 3 outputs a*b,a,b it become very easy.


This is very surprising to me and I don't completely understand it.

Let me ask; Does this crucially rely on deterministic behavior of the transformer?

The only way I can imagine what you say is that the transformer can see the past inputs, and if it is deterministic know what (internal) chain of thought it must have had and hence what its plan (factorization) would have been when it generated the output (first prime number).

In this case, I have the impression that any amount of non-determinicity, i.e. randomness, breaks this mechanism and hence transformers with noise are weaker than RNNs with noise. Would you agree?

Furthermore, if I understand correctly it does seem like it would be strange for a model to learn to use its knowledge of its own deterministic behavior to predict what the only strategy it could have chosen must have been. I just can't imagine a transformer solving the prime product generation task with this method, even if generating large primes and muktiplying them was easy for it. (And I can easily imagine an RNN solving the task).

It reminds me of solutions to infinite prisoner hat problems where you agree on some abstract well-ordering of possibilities...


No it doesn't crucially rely on deterministic behavior (you can add some dropout layers if you want).

Transformers does see the past inputs. But it won't really use them, except to "serialize the answer" that it will form on its first time-step : From the random initial context vector, splitting this space in regions it will generate two "numbers" as features, and multiply them as features and from this feature vector containing the answer it will just serialize it. At no times will the network ever factorize anything.

As an exercise you can probably manually specify the weights that will do exactly the task I describe, by considering each feature as an individual bit, and specifying layers weights such like they behave as multiplier logical circuit (with or-gates, and-gates and not-gates ...).

(Therefore the architecture can at least express the solution even though the network will probably have a really hard time reaching it)

Alternatively, a transformer is capable of over-fitting exactly a dataset : so if you give it a dataset ab,a,b it is perfectly capable of memorizing it exactly so it will be able to produce the task you want, even though it will not have understood anything.

Alternatively, if you want for the network to implicitly discover chain of thought reasoning then you will have to train the transformer to produce ????????ab,a,b this is a much harder task to learn because the gradient has to flow across various time-steps and through discrete sampling (instead of just trying to predict from imitation using known previous character). This is a Reinforcement Learning problem, that has to deal with credit assignment and long time horizon. You can use decision transformer. Or use other larger LLM that has access to the task description to help sample intelligent ???????, and train standard transformer on this sampled inputs, define a reward model, and build datasets of increasing rewards.

Even in these reinforcement learning, you can reach the optimal state without having to "plan". You can learn optimal controller that plan implicitly by just solving for the Bellman equation. This controller are living in the present and not anticipating the future outcomes, but you can also build controller like Model Predictive Control, that at each time-step make a plan, project it and solve for the best next move, but it's much more computationally expensive.


Hmm I would love for someone else to give their opinion because I find it very interesting but don't quite understand it yet. The way I see it, at the very beginnig the transformer has a large number of choices for numbers a, b that allow it to solve the problem. If there is randomness present then it will (pseudo-)randomly choose a pair a, b, with the intention to write ab,a,b

After writing the first digit of ab, as I understand it wanta to recover its features by doing the same operations as before on the past sequence. But as the computation of the features is non-deterministic, it can't arrive at the same pair a, b.

Let me try to specifiy a more difficult task for a tranaformer with randomness: I want you to generate exactly three numbers in the form c,a,b where a and b are prime numbers with 250-300 digits and c=ab. I want these numbers to be randomly chosen with a distribution so that the range of primes 250-300 is approximately uniformly covered.

Suppose now a transformer has uniformly picked a and b and generated the first digit of ab. Let's in fact say it has already generated all the digits of ab. If the transformer has weights that make it now successfully print a, b (say with probability > 0.99), then you have constructed a method of factoring products of prime numbers in the 250-300 digit range, i.e. you just initialize the context window with the desired number 500-600 number ab to be factorized, and let the transformer do its work.

I.e. such a transformer has to be computationally powerful enough to factorize large prome products with >0.99 percent accuracy.

On the other hand an RNN or a human both with randomness can solve this task without having to be computationally powerful enough to do the factorization.


>you just initialize the context window with the desired number 500-600 number ab to be factorized

You also have to initialize the initial random context correctly correlated with your initial ab prompt (before the prompt, aka the initial state) which it has used to generate this number ab. For example give it a feature vector corresponding to a, b, and ab written in binary (in the way the network has learned to do). It won't ever learn to invert the one way function ab-> a,b from only being shown ab.

In practice, the learning signal will be quite weak because only when it has seen ab,a,b aka at the last character and back-propagate it through to the initial time (in training all previous character level prediction will amount to noise that it will have to learn to model first to be able to ignore ("Benford's law" distortion ??) , (or you can just give weight of zero to these intermediate character predictions that only add to the noise)

About the deterministic/randomness, the task is fundamentally deterministic as it has a single answer, so noise won't help but a non-deterministic network can/will learn to produce deterministic output without problem. In fact if at any point the network output a wrong digit, it won't be able to recover (except if you give him some character like backspace in which case each time it outputs a wrong digit not corresponding to its initial state that it want to write it will output a backspace for the next character and try its luck again ) and the answer will be wrong, but the output will look something like ab,a,b where a and b are real primes but with some wrong digits (like if they had been corrupted by a noisy channel).


You can pass context in as another input, so encoder->decoder cross attention models. People are working on data storage and retrieval as well with transformers (so you would query out to load in different attention vectors).

So you can just view that as a memory bank with paging. You do however need to teach the network the task you want to solve. So you'd have to train it on the math part.


Kinda depends on how you define a transformer solving the problem. I feel like you could fine-tune a transformer in the style of https://www.ai21.com/blog/jurassic-x-crossing-the-neuro-symb... to produce steps to solve it, ie

    "<initial prompt>" =>
    1. [LLM generate] write a python program to return an integer >500 digits with two different prime factors with >250 digits and return all three
    2. [execute code generated by 1 and return result]
ChatGPT did actually generate code that worked when I manually put in step 1, it'll be interesting to see what systems people chain together


This was the Python code outputted by ChatGPT when I pasted in the original problem:

  import random
  import sympy
  
  def generate_large_int(num_digits):
      return int("".join(str(random.randint(0,9)) for _ in range(num_digits)))
  
  # Generate the first large integer
  first_int = generate_large_int(500)
  
  # Generate two prime factors of the first integer
  while True:
      factor1 = sympy.randprime(10\*250, 10\*251)
      factor2 = sympy.randprime(10\*250, 10\*251)
      if first_int % factor1 == 0 and first_int % factor2 == 0:
          break
  
  print(f"{first_int},{factor1},{factor2}","STOP")


Yeah to get the working result I had to reword the problem a little. In my post I kinda assume you could fine tune a model to do that. It is a little cheap though.


By a transformer I essentially mean a model (probably LLM) that operates on a context-window of tokens by appending a new token and then shifting.

If I understand correctly the solution you give would be an example of what I mean by using prompt engineering to solve the problem. It is like building a meta model using a transformer.


The individual human in your comparison is arguably cheating, as they are probably vocalising (or otherwise expressing, for people without inner monologue) several thoughts inside of their head before actually deciding to say anything. Conversely, if you allow a transformer to have an "inner monologue" (token stream it can build up without it counting as an output) too, I don't see why it shouldn't be able to solve these types of problem.

A human connected to one of those EEG devices that can predict your rock-paper-scissors hand before you consciously know it and subjected to the constraint that they must not output (as detected by the device) anything but some result would appear rather limited, too.


Yes exactly. The issues only arise if one doesn't allow the transformer the freedom to use the output for chain of thought. Furthermore we can promot engineer a desired chain of thought behavior but I am afraid that it would be difficult to have the transformer prompt engineer itself.


As a sidenote and somewhat related: Suppose we train BackwardsGPT on the same sequences of tokens as GPT but reversed. The question is: What would this model be good at? Because it would be ver bad at most things. But some tasks might be easier. By understanding what BackwardsGPT can do, we can understand what GPT naturally can't do.


Maybe not super relevant but people used to do that for translation with RNNs. They reversed the input so that the hidden states at the end would be close to what was needed to start translating. Nobody really did it enc-dec transformers but it'd be a funny thought to pretrain a dec only transformer with and without to see if it helped.


Attention is Turing complete


Here I am not talking about computational power in this "absolute" sense but in the sense that a model A is fundamentally worse at doing a task than a model B if model A requires factorizing large prime numbers to solve it while model B doesn't.

So when I say "possible" or "computationally feasible" I mean that the orders of complexity of various tasks might be unnecessarily high without a hidden state. One can then go on to construct examples where this really makes the difference between "practically possible" and "practically impossible for any alien computer of the size of earth".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: