This paper is misusing the term "reasoning" in my opinion.
At no point does the LLM know that 5+6 = 11, and if asked to solve a problem in which 5+6 was an implicit component of the solution but not explicitly present in the text, it would be completely lost.
What is a reasoning task we could give an LLM that would demonstrate that it actually is not reasoning? It seems like that should be easy to construct as a very simple task outside its training set would fail utterly, but I have yet to witness one.
1a. generate two numbers using: (random() % BIGNUM)
1b. ask the LLM to multiply them together
Any human who has learned multiplication can do this. AFAIU, LLMs cannot unless the computation exists within the training set. They have zero arithmetic reasoning capability.
I just asked ChatGPT: "19191920 multipled by 10292111772"
It said:
--------------
To multiply 19191920 by 10292111772, you can use the standard long
multiplication method as follows:
19191920
x 10292111772
-------------
19191920000 (the product of 19191920 and 1)
153535360000 (the product of 19191920 and 8)
1535353600000 (the product of 19191920 and 2)
-------------
196837644266310240 (the final product)
Therefore, the result of multiplying 19191920 by 10292111772 is 196837644266310240.
----------------------------------
This is completely wrong. It is not doing arithmetic, and it is not capable of doing arithmetic.
Many humans would not be able to solve that problem, especially those that are younger or have lower IQs, and obviously those that have not not been taught multiplication. That doesn't mean they are devoid of all reasoning ability.
Furthermore GPT-4 is able to multiple slightly shorter pairs of numbers, and experiments on smaller neural nets, such as one using a dataset consisting of 20% of all combinations of two four digit numbers, quickly learn to generalize to successfully multiply any four digit pair even though they aren't in the training set.
Try again, give an answer that actually works this time.
PROMPT: I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?
RESPONSE (edited and truncated): Building an adobe home in an area with heavy clay soil can be a good option for creating a well-insulated home [... ] The insulation properties of an adobe home will depend on several factors, including the thickness of the walls, the quality of the materials used, and the construction techniques used. Adobe walls tend to be thick, which can help to provide good insulation properties. The high thermal mass of adobe can also help to regulate temperature and humidity inside the house, keeping it cool in the summer and warm in the winter.
So, here it is clear that GPT is just regurgitating information contained in its training set. Why am I so confident? It is an extremely widespread misconception, repeated almost everywhere, that adobe homes are "well insulated", and almost everywhere this is written about, you will the linkage between high thermal mass and insulation.
But this is actually completely incorrect. Adobe is a very poor insulator - a typical wall has an R factor of around 2. Ask someone who knows and you will find that "adobe is not an insulator at all" (e.g. https://www.greenhomebuilding.com/QandA/adobe/mass.htm)
So I went back to GPT to see if it "knew" about the poor R value:
--------------------
PROMPT: what is the R value of an adobe wall
RESPONSE: The R-value of an adobe wall depends on several factors, including the thickness of the wall, the density of the material, and the specific mix of materials used in the construction. Generally speaking, adobe walls have an R-value of around 0.2 to 0.5 per inch of thickness, which is relatively low compared to some other building materials.
For example, a 12-inch thick adobe wall might have an R-value of around 2.4 to 6.0, depending on the specific mix of materials used. This means that an adobe home will typically require additional insulation, such as fiberglass batts or blown-in cellulose, to achieve a higher level of thermal performance.
-------------------------------------------------
OK, so it does "know" this, and what is has generated here is factually correct, but at odds with its initial response. If it "reasoned" it would know that, given this R value, an adobe wall is pretty terrible for insulation, and would have focused on the details in this response in answering my first prompt.
Then it continues:
-----------------------------------
However, it's important to note that the insulation properties of adobe walls are not solely determined by the R-value [...]
-----------------------------------
and this final claim is completely incorrect. Insulation properties are 100% represented by R values, and anyone who actually knows this stuff would know this. It then goes on to repeat the stuff about thermal mass, which is important for how a house feels, but unrelated to its level of insulation and thus its heating requirements etc.
Now, I imagine that given all this, one could do some prompt "engineering" to get GPT to spit out something that reflects the answer that a human who actually knew and could reason about this stuff might. But I have zero doubt that what you'd actually be doing is adjusting the vocabulary to make it more likely it would base its response on e.g. the Green Building Advisor article above. I do not believe there are any prompts, or anything else, in GPT or any other LLM, that will cause it to "reason" ... hmm, let's check the R value for adobe, nope that's pretty horrible, the house will not be well insulated unless you ....
Everyone knows it has limitations. You have to work within the limitations of the model. No one has claimed that GPT is AGI. Doesn't mean it's incapable of any degree of reasoning. Yes the prompt actually matters. It was trained a specific way to solve specific tasks, and can generalize to solve tasks it has not seen before.
Try this prompt: "Taking into account the r-value of adobe, I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?"
If I knew the r-value, I wouldn't need to ask an LLM.
The sort of logic systems that were the focus of a lot of AI work before "deep learning" came along would certainly have "taken the r-value of adobe" into account (had they been exposed to such knowledge). That's because they explicitly reason about things in the world that they are trained to reason about.
Gary Marcus has been quite usefully vocal about this. We used to try to build AI systems (some still are) based on the idea that you need a world model, and you need logic and inference and relationships.
LLMs have convinced, it seems, rather a lot of people that we can just discard all that - "the system will learn the patterns all by itself".
Marcus doesn't agree, and neither do I (not that my opinion is worth much).
You have a narrow definition of reasoning. Formally and technically it is solving a symbolic reasoning task through a sequence of steps. Yes we know it's not conscious and not human reasoning.
> At no point does the LLM know that 5+6 = 11
Does it need to "know" that (by your narrow definition of "know") in order to reason about a word math problem?
> if asked to solve a problem in which 5+6 was an implicit component of the solution but not explicitly present in the text, it would be completely lost
Can you provide an example? What makes you believe it can't be trained to solve those too? That's just a higher abstraction over the language. Add more layers, more training, etc. Many humans cannot solve basic math word puzzles that this artificial neural network can already solve.
They do not "solve" word puzzles. They output text that appears to be the best response to the prompt, based on their training data. If the puzzle is solvable by doing this, then they get the answer right. If the puzzle is not solvable doing that, they are unlikely to get the answer right.
If I ask you to multiply two (largeish) numbers together, you will be able to do so, using an algorithm/process that you can apply to the multiplication of any two numbers, whether anyone has ever told you about those numbers before or not.
LLM's cannot do this. Give them a math problem that doesn't exist in their training set and they cannot solve it. This has been demonstrated many times.
> Give them a math problem that doesn't exist in their training set and they cannot solve it.
They routinely solve math problems (and other reasoning tasks) that don't exist in their training set. Examples were in that paper I linked to. This is one of the incredible emergent properties of LLMs / deep neural networks.
Try it out today on GPT-4. Make up your own math problems and go for it.
First of all, multiplying two numbers together is not "shifting the goal posts", but an absolutely basic test of any system that is claimed to able to do mathematical reasoning. I know that LLM's are not well suited for this, and that's because they cannot do arithmetic (among other things).
So I tried a word puzzle that would also require simple multiplication:
------------------------------
PROMPT: i am going to cycle 1600 miles, with 234 miles on gravel roads. on paved roads i will ride at 1929288282 millimeters per second but on gravel I will ride at 0.00000000202 parsecs per second. How long will the journey take?
-------------------------------
Now, I have to commend GPT on its ability to understand how you solve a problem like this, though that's not really very surprising given the huge numbers of such problems that exist in written materials. It precisely broke the problem down in a way that I suppose you could call "reasoning", but I would call "copying the formula for solving puzzles like this".
And how did it do with the actual math?
----------------
0.00000000202 parsecs per second is equivalent to 7499.6103827 miles per hour (mph), which we can calculate by converting parsecs to miles (1 parsec = 3.26 light-years = 19,173,511,840,000 miles) and dividing by the number of seconds in an hour:
Utterly and completely wrong. Same issue with the millimeters per second computation.
It is completely obvious why LLMs cannot do this. They cannot perform even basic arithmetic reasoning, and even more fundamentally, the ONLY capability they have is to create likely responses to prompts. For some things, this is extraordinarily (and scarily) powerful. But it is not reasoning.
You are so narrowly fixated on this one specific domain which is an edge case with LLMs.
ChatGPT was trained specifically to solve "natural language understanding tasks".
You are missing the forest for the trees and ignoring everything else it's good at solving outside of its training set.
The emergent properties of neural networks should not be so casually dismissed.
You're arguing that since its not perfect at arithmetic reasoning then its not capable of any degree of reasoning in any domain. That is an oversimplification and just doesn't make logical sense.
I do not believe that any part of any current or future LLM (i.e. using the same fundamental architecture) is capable of reasoning, or in fact, capable of anything other than, essentially, doing a really, really, really good job of generating the next word in a response.
I am all about emergent properties of neural networks, but I absolutely do not believe that LLMs have them, specifically because of the way they are designed.
However, as to the specifics, people who seem to believe otherwise claim that they can reason, and so that's merely one specific angle of attack: to show that they cannot reason, and that in fact, everything they do is implicitly contained in their training set.
As I've already said, what they can do is enormously powerful and in many (most?) ways entirely unexpected, so I still regard the advent of LLMs as extremely significant, both from a practical but also a scientific point of view. I think it may force a revision in the most basic aspects of understanding human speech behavior, for example.
Nevertheless, I do not believe that anybody is served by believing that these systems can do things that they cannot. I do not understand why the seemingly magic results of these systems is leading so many into a denial of what they actually do.
> I am all about emergent properties of neural networks, but I absolutely do not believe that LLMs have them, specifically because of the way they are designed.
Even when faced with evidence that contradicts your beliefs and proves that you are wrong?
LLMs are a type of neural network. What fundamentally prevents LLMs from having emergent abilities while other neural networks do have them?
How do you explain the emergent abilities that we have actually observed in LLMs?
Here is an example of an emergent ability of ChatGPT that you can try yourself right now.
Give it this prompt: "Write a short play where "Karen" (who behaves as the Karen meme) is on a romantic date with Hunter S. Thompson. They are eating at a Chinese restaurant."
It is able to script their interaction and dialog in a way that makes sense in the context of the setting and their personalities, including an absurd meme character. That is an emergent ability that it was not trained to do and is certainly not in its training set.
Try it. Add different characters. Ask it to rewrite the play using pirate metaphors. You can go deep into its "mind" and see the emergent abilities at play. Just apply some creativity and skip the boring arithmetic problems, as that's a well accepted weakness of this type of model.
What do you mean we don't have evidence? It has already been presented to you. You choose to reject it for reasons I truly don't understand. Search on Google Scholar if you want a more academic explanation. You can try it out on ChatGPT right now and see it yourself.
> The level of naievete around this stuff is quite incredible.
"Emergent properties" has a formal definition in the literature.
In machine learning we don't explicitly program the machine to understand anything. It automatically learns patterns in the data. That's the entire point of machine learning. With such a large neural network and training set obviously it's hard to predict all of its capabilities due to the sheer scale of it all. Of course we cannot predict exactly how it will model things.
Take this for example. No one programmed it to understand Go. It learned by itself and became the world champion. That's what deep learning is capable of.
Solving natural language understanding tasks requires reasoning, by definition. I think you are sticking to a narrow definition of reasoning that is not very technical. Formally there are many types of reasoning in AI.
But LLMs do not solve natural language understanding in any of the meanings that the phrase meant before LLMs. Instead, they throw a completely new technique at it that completely sidesteps the need for language understanding and what do you know? For the purposes of responding in meaningful, generally sensible ways, it works amazingly well. And that is incredibly cool. But it doesn't solve the (all) problem(s) that more historical approaches to machine language "understanding" were concerned with.
But there is no world representation inside an LLM, only text (words, letters) representations, so nothing the LLM does can be based on reasoning in a traditional sense.
I would wager that if we build an LLM based on a training data set collection, and then we rebuild it with a heavily edited version of the data set that explicitly excludes certain significant areas of human discourse, the LLM will be severely impaired in its apparent ability to "reason" about anything connected with the excluded areas. That sounds as if it ought to surprise you, since you think they are capable of reasoning beyond the training set. It wouldn't surprise me at all, since I do not believe that it what they are doing.
LLMs contain a model of human speech (really text) behavior that is almost unimaginably more complex than anything we've built before. But by itself that doesn't mean very much with respect to general reasoning ability. The fact that LLMs can convince you otherwise points, to me, to the richness of the training data in suitable responses to almost any prompt,suitable, that is, for the purpose of persuading you that there is some kind of reasoning occuring. But there is not. The fact that neither you nor I can really build a model (hah!) of what the LLM is actually doing doesn't change that.
> But LLMs do not solve natural language understanding in any of the meanings that the phrase meant before LLMs.
Are you saying that NLP as a field of research did not exist before LLMs? This is a continuation of research that has been in progress for decades.
> But there is no world representation inside an LLM, only text (words, letters) representations, so nothing the LLM does can be based on reasoning in a traditional sense.
Not true. The model has learned a representation of semantic relationships between words and concepts at multiple levels of abstraction. That is the entire point. That's what is was trained to do.
It's a vast and deep neural network with a very high dimensional representation of the data. Those semantic/meaning relations are automatically learned and encoded in the model.
> It's a vast and deep neural network with a very high dimensional representation of the data.
the data is text, so ...
It's a vast and deep neural network with a very high dimensional representation of *text*
And yes, to some extent, text represents the world in interesting ways. But not adequately, IMO.
If you were an alien seeking to understand the earth, starting with humans' textual encoding thereof might be a palce to start. But its inadequacies would rapidly become evident, I claim, and you would realize that you need a "vast and deep representation" of the actual planet.
> Are you saying that NLP as a field of research did not exist before LLMs? This is a continuation of research that has been in progress for decades.
Of course I'm not saying that (the first sentence). Part of my whole point is that LLMs are to NLPs as rockets are to airplanes. They're fundamentally a "rip it up and start again" approach, that discards almost everything everyone knew about NLP. The results are astounding, but the connection with, yes, "traditional" NLP is tenuous.
> Part of my whole point is that LLMs are to NLPs as rockets are to airplanes.
Yes it is deep learning applied to NLP. Makes the old designs obsolete
> the data is text
It is not randomly generated text. There are patterns in that text. It was trained to model the semantics or "meaning" in the text. There is a structure in the text which the machine has recognized.
It automatically learned a model of many concepts without any of those concepts being explicitly programmed into it. That's the entire point of machine learning.
> But not adequately, IMO.
It is adequate for some things and not adequate for other things.
It seems that all you are saying is that GPT is not AGI and doesn't have human level of understanding and reasoning. No one disagrees with that.
> It seems that all you are saying is that GPT is not AGI and doesn't have human level of understanding and reasoning. No one disagrees with that.
No, I'm saying that LLMs perform what is essentially an amazing, really hard to understand party trick that happens to be so good it is actually useful. They may be indicative (surprisingly) of how a lot of actual human speech production happens (or may not), and if so, that's really good to discover. That this is surprising represents a bit of a failure of our imagination, but it also leaves untouched the parts of speech production that don't use this sort of process.
Here's a prompt:
"If there is a bob outside, and a bib inside, and the bib cannot see the bob, can the bob see the bib?"
Answer:
"If the bib cannot see the bob due to being inside, it is likely that the bob, which is outside, also cannot see the bib. The reason for the bib not being able to see the bob is presumably due to some kind of barrier, such as a wall or a window, which would likely obstruct the view for both the bib and the bob."
Prompt:
"What type of reasoning did you just apply to solve that task?"
Answer:
"In solving the task, I applied abductive reasoning. Abductive reasoning involves inferring the most likely explanation for a set of observations or statements. In this case, I inferred that a barrier (e.g., a wall or window) is the most likely explanation for why the bib cannot see the bob, and thus, the bob likely cannot see the bib either."
ChatGPT solves the problem by a mechanical reasoning process, not a "party trick". It's a machine, not a person at a party. Surely you don't believe that it has seen that exact sequence of questions and replies in its training set and is simply regurgitating. You can argue about whether its actually using "abductive reasoning" or not, but that is besides the point that it involves some mechanical reasoning process over an interpretation of the prompt. It's not a simple regurgitation.
AlphaZero learned to play Go starting with nothing but the rules of the game. What is it regurgitating there?
Alright so deep learning, the state of the art of AI, is a "party trick". AlphaZero is likewise a party trick. No "true" reasoning involved.
> Like actual reasoning.
You're relying on intuition and personal beliefs of what constitutes "true" reasoning instead of formal rigorous mathematical definitions of reasoning. The general concept of reasoning includes what the language models are doing when they solve natural language understanding tasks, by definition.
So what I'm saying is, GPT "knows" what a cat is. It "knows" what an orange is. It has inferred these concepts from the data set.
Imagine approaching someone who is tripping on LSD and demanding they immediately solve a 10 digit multiplication problem, then saying "AHA! You cannot solve it, therefore you are incapable of any reasoning whatsoever!"
We are talking about reasoning in a general sense. There are many types of reasoning in AI which I'm sure you know how to look up and read about. "Traditional" is not one of the categories.
I don't agree that LLMS can't reason, but literally saying "Make up your own math problems and go for it", him doing that and it failing really isn't moving the goal posts.
LLMs are not good at math. But this is a subset of reasoning.
Chain-of-thought on logical inference tasks (using fake labels so we are outside the training set) shows they can do reasonably well at these.
Nevertheless, it's likely that the best approach for pure reasoning tasks will be to connect a LLM to a real inference engine (datalog or something) and rely on the LLM to perform the mapping to the inference engine inputs and outputs. This is similar to the "System 1" and "System 2" models of human thought.
At no point does the LLM know that 5+6 = 11, and if asked to solve a problem in which 5+6 was an implicit component of the solution but not explicitly present in the text, it would be completely lost.