Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is a reasoning task we could give an LLM that would demonstrate that it actually is not reasoning? It seems like that should be easy to construct as a very simple task outside its training set would fail utterly, but I have yet to witness one.


1a. generate two numbers using: (random() % BIGNUM) 1b. ask the LLM to multiply them together

Any human who has learned multiplication can do this. AFAIU, LLMs cannot unless the computation exists within the training set. They have zero arithmetic reasoning capability.

I just asked ChatGPT: "19191920 multipled by 10292111772"

It said:

--------------

  To multiply 19191920 by 10292111772, you can use the standard long 
  multiplication method as follows:

     19191920
  x 10292111772
  -------------
    19191920000 (the product of 19191920 and 1)
    153535360000 (the product of 19191920 and 8)
    1535353600000 (the product of 19191920 and 2)
  -------------
  196837644266310240  (the final product)

  Therefore, the result of multiplying 19191920 by 10292111772 is 196837644266310240.
----------------------------------

This is completely wrong. It is not doing arithmetic, and it is not capable of doing arithmetic.


Many humans would not be able to solve that problem, especially those that are younger or have lower IQs, and obviously those that have not not been taught multiplication. That doesn't mean they are devoid of all reasoning ability.

Furthermore GPT-4 is able to multiple slightly shorter pairs of numbers, and experiments on smaller neural nets, such as one using a dataset consisting of 20% of all combinations of two four digit numbers, quickly learn to generalize to successfully multiply any four digit pair even though they aren't in the training set.

Try again, give an answer that actually works this time.


All you have shown here is that its arithmetic reasoning is flawed, not that it cannot reason at all.

"Reasoning" is a much broader term than you think it is.


For my next test, I used some knowledge I happen to have.

----------------------------------------------------------

PROMPT: I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?

RESPONSE (edited and truncated): Building an adobe home in an area with heavy clay soil can be a good option for creating a well-insulated home [... ] The insulation properties of an adobe home will depend on several factors, including the thickness of the walls, the quality of the materials used, and the construction techniques used. Adobe walls tend to be thick, which can help to provide good insulation properties. The high thermal mass of adobe can also help to regulate temperature and humidity inside the house, keeping it cool in the summer and warm in the winter.

-------------------------------------------------------

So, here it is clear that GPT is just regurgitating information contained in its training set. Why am I so confident? It is an extremely widespread misconception, repeated almost everywhere, that adobe homes are "well insulated", and almost everywhere this is written about, you will the linkage between high thermal mass and insulation.

But this is actually completely incorrect. Adobe is a very poor insulator - a typical wall has an R factor of around 2. Ask someone who knows and you will find that "adobe is not an insulator at all" (e.g. https://www.greenhomebuilding.com/QandA/adobe/mass.htm)

So I went back to GPT to see if it "knew" about the poor R value:

--------------------

PROMPT: what is the R value of an adobe wall

RESPONSE: The R-value of an adobe wall depends on several factors, including the thickness of the wall, the density of the material, and the specific mix of materials used in the construction. Generally speaking, adobe walls have an R-value of around 0.2 to 0.5 per inch of thickness, which is relatively low compared to some other building materials.

For example, a 12-inch thick adobe wall might have an R-value of around 2.4 to 6.0, depending on the specific mix of materials used. This means that an adobe home will typically require additional insulation, such as fiberglass batts or blown-in cellulose, to achieve a higher level of thermal performance.

-------------------------------------------------

OK, so it does "know" this, and what is has generated here is factually correct, but at odds with its initial response. If it "reasoned" it would know that, given this R value, an adobe wall is pretty terrible for insulation, and would have focused on the details in this response in answering my first prompt.

Then it continues:

-----------------------------------

However, it's important to note that the insulation properties of adobe walls are not solely determined by the R-value [...]

-----------------------------------

and this final claim is completely incorrect. Insulation properties are 100% represented by R values, and anyone who actually knows this stuff would know this. It then goes on to repeat the stuff about thermal mass, which is important for how a house feels, but unrelated to its level of insulation and thus its heating requirements etc.

Now, I imagine that given all this, one could do some prompt "engineering" to get GPT to spit out something that reflects the answer that a human who actually knew and could reason about this stuff might. But I have zero doubt that what you'd actually be doing is adjusting the vocabulary to make it more likely it would base its response on e.g. the Green Building Advisor article above. I do not believe there are any prompts, or anything else, in GPT or any other LLM, that will cause it to "reason" ... hmm, let's check the R value for adobe, nope that's pretty horrible, the house will not be well insulated unless you ....


Everyone knows it has limitations. You have to work within the limitations of the model. No one has claimed that GPT is AGI. Doesn't mean it's incapable of any degree of reasoning. Yes the prompt actually matters. It was trained a specific way to solve specific tasks, and can generalize to solve tasks it has not seen before.

Try this prompt: "Taking into account the r-value of adobe, I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?"

These edge case "gotchas" are missing the point.


If I knew the r-value, I wouldn't need to ask an LLM.

The sort of logic systems that were the focus of a lot of AI work before "deep learning" came along would certainly have "taken the r-value of adobe" into account (had they been exposed to such knowledge). That's because they explicitly reason about things in the world that they are trained to reason about.

Gary Marcus has been quite usefully vocal about this. We used to try to build AI systems (some still are) based on the idea that you need a world model, and you need logic and inference and relationships.

LLMs have convinced, it seems, rather a lot of people that we can just discard all that - "the system will learn the patterns all by itself".

Marcus doesn't agree, and neither do I (not that my opinion is worth much).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: