I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
> There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.
Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.
So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.
It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.
LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.
> What we do have are very good pattern matchers and probabilistic data generators
This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
Language matters. Using language that accurately describes concepts and processes is important. It might not matter to a language model which only sees patterns, but it matters to humans.
So when we label the technical processes and algorithms these tools use as something that implies a far greater level of capability, we're only doing a disservice to ourselves. Maybe not to those of us who are getting rich on the market hype that these labels fuel, but certainly to the general population that doesn't understand how the technology works. If we claim that these tools have super-human intelligence, yet they fail basic tasks, how do we explain this? More importantly, if we collectively establish a false sense of security and these tools are adopted in critical processes that human lives depend on, who is blamed when they fail?
> This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
No, because we have descriptive language to describe a collection of gates and wires by what it enables us to do: perform arbitrary computations, hence a "computer". These were the same tasks that humans used to do before machines took over, so the collection of gates and wires is just an implementation detail.
Pattern matching, prediction, data generation, etc. are the tasks that modern AI systems allow us to do, yet you want us to refer to this as "intelligence" for some reason? That makes no sense to me. Maybe we need new higher level language to describe these systems, but "intelligence", "thinking", "reasoning" and "wit" shouldn't be part of it.
>There's nothing "omniscient" or "dim-witted" about these tools
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
> I figure it's hard to argue that that is not at least somewhat intelligent?
The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.
The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.
But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.
>The machine has no semantic understanding of what the data represents.
How do you define "semantic understanding" in a way that doesn't ultimately boil down to saying they don't have phenomenal consciousness? Any functional concept of semantic understanding is captured to some degree by LLMs.
Typically when we attribute understanding to some entity, we recognize some substantial abilities in the entity in relation to that which is being understood. Specifically, the subject recognizes relevant entities and their relationships, various causal dependences, and so on. This ability goes beyond rote memorization, it has a counterfactual quality in that the subject can infer facts or descriptions in different but related cases beyond the subject's explicit knowledge. But LLMs excel at this.
>feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry"
This failure mode has nothing to do with LLMs lacking intelligence and everything to do with how tokens are represented. They do not see individual characters, but sub-word chunks. It's like expecting a human to count the pixels in an image it sees on a computer screen. While not impossible, it's unnatural to how we process images and therefore error-prone.
You don't need phenomenal consciousness. You need consistency.
LLMs are not consistent. This is unarguable. They will produce a string of text that says they have solved a problem and/or done a thing when neither is true.
And sometimes they will do it over and over, even when corrected.
Your last paragraph admits this.
Tokenisation on its own simply cannot represent reality accurately and reliably. It can be tweaked so that specific problems can appear solved, but true AI would be based on a reliable general strategy which solves entire classes of problems without needing this kind of tweaking.
This is a common category of error people commit when talking about LLMs.
"True, LLMs can't do X, but a lot of people don't do X well either!"
The problem is, when you say humans have trouble with X, what you mean is that human brains are fully capable of X, but sometimes they do, indeed, make mistakes. Or that some humans haven't trained their faculties for X very well, or whatever.
But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.
These things are not comparable.
So, to your specific point: When an LLM is inconsistent, it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data. When a human is inconsistent, it is because they got distracted, didn't learn enough about this particular subject, or otherwise made a mistake that they can, if their attention is drawn to it, recognize and correct.
LLMs cannot. They can only be told they made a mistake, which prompts them to try again (because that's the pattern that has been trained into them for what happens when told they made a mistake). But their next try won't have any better odds of being correct than their previous one.
>But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.
This is the very point of contention. You don't get to just assume it.
> it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data.
Another highly contentious point you are just outright assuming. LLMs are modelling the world, not just "predicting the next token". Some examples here[1][2][3]. Anyone claiming otherwise at this point is not arguing in good faith. It's interesting how the people with the strongest opinions about LLMs don't seem to understand them.
OK, sure; there is some evidence potentially showing that LLMs are constructing a world model of some sort.
This is, however, a distraction from the point, which is that you were trying to make claims that the described lack of consistency in LLMs shouldn't be considered a problem because "humans aren't very consistent either."
Humans are perfectly capable of being consistent when they choose to be. Human variability and fallibility cannot be used to handwave away lack of fundamental ability in LLMs. Especially when that lack of fundamental ability is on empirical display.
I still hold that LLMs cannot be consistent, just as TheOtherHobbes describes, and you have done nothing to refute that.
Address the actual point, or it becomes clear that you are the one arguing in bad faith.
You are misrepresenting the point of contention. The question is whether LLMs lack of consistency undermines the claim that they "understand" in some relevant sense. But arguing that lack of consistency is a defeater for understanding is itself undermined by noting that humans are inconsistent but do in fact understand things. It's as simple as that.
If you want to alter the argument by saying humans can engage in focused effort to reach some requisite level of consistency for understanding, you have to actually make that argument. It's not at all obvious that focused effort is required for understanding or that a lack of focused effort undermines understanding.
You also need to content with the fact that LLMs aren't really a single entity, but are a collection of personas, and what you get and its capabilities do depend on how you prompt it to a large degree. Even if the entity as a whole is inconsistent between prompts, the right subset might very well be reliably consistent. There's also the fact of the temperature setting that artificially injects randomness into the LLMs output. An LLM itself is entirely deterministic. It's not at all obvious how consistency relates to LLM understanding.
Feel free to do some conceptual work to make an argument; I'm happy to engage with it. What I'm tired of are these half-assed claims and incredulity that people don't take them as obviously true.
I imagine if you asked the LLM why the wolf can't be close to the goat it would give a reasonable answer. I realise it does it by using permutation of tokens but I think you have to judge intelligence by the results rather than the mechanism otherwise you could argue humans can't be intelligent because they are just a bunch of neurons that find patterns.
Actually I think the Chinese room fits my idea. It's a silly thought experiment that would never work in practice. If you tried to make one you would judge it unintelligent because it wouldn't work. Or at least in the way Searle implied - he basically proposed a look up table.
We have had programs that can give good answers to some hard questions for a very long time now. Watson won jeapordy already 2011, but it still wasn't very good at replacing humans.
So that isn't a good way to judge intelligence, computers are so fast and have so much data that you can make programs to answer just about anything pretty well, LLM is able to do that but more automatic. But it still doesn't automate the logical parts yet, just the lookup of knowledge, we don't know how to train large logic models, just large language models.
LLMs are not the only model type though? There's a plethora of architectures and combinations being researched.. And even transformers start to be able to do cool sh1t on knowledge graphs, also interesting is progress on autoregressive physics PDE (partial differential equations) models.. and can't be too long until some providers of actual biological neural nets show up on openrouter (probably a lot less energy and capital intense to scale up brain goo in tanks compared to gigawatt GPU clusters).. combine that zoo of "AI" specimen using M2M, MCP etc. and the line between mock and "true"intelligence will blur, escalating our feable species into ASI territory.. good luck to us.
> There's a plethora of architectures and combinations being researched
There were plethora of architectures and combinations being researched before LLM, still took a very long time to find LLM architecture.
> the line between mock and "true"intelligence will blur
Yes, I think this will happen at some point. The question is how long it will take, not if it will happen.
The only thing that can stop this is if intermediate AI is good enough to give every human a comfortable life but still isn't good enough to think on its own.
Its easy to imagine such an AI being developed, imagine a model that can learn to mimic humans at any task, but still cannot update itself without losing those skills and becoming worse. Such an AI could be trained to perform every job on earth as long as we don't care about progress.
If such an AI is developed, and we don't quickly solve the remaining problems to get an AI to be able to progress science on its own, its likely our progress entirely stalls there as humans will no longer have a reason to go to school to advance science.
This approach to defining “true” intelligence seems flawed to me because of examples in biology where semantic understanding is in no way relevant to function. A slime mold solving a maze doesn’t even have a brain, yet it solves a problem to get food. There’s no knowing that it does that, no complex signal processing, no self-perception of purpose, but nevertheless it gets the food it needs. My response to that isn’t to say the slime mold has no intelligence, it’s to widen the definition of intelligence to include the mold. In other words, intelligence is something one does rather than has; it’s not the form but the function of the thing. Certainly LLMs lack anything in any way resembling human intelligence, they even lack brains, but they demonstrate a capacity to solve problems I don’t think is unreasonable to label intelligent behavior. You can put them in some mazes and LLMs will happen to solve them.
>LLMs lack anything in any way resembling human intelligence
I think intelligence has many aspects from moulds solving mazes to chess etc. I find LLMs resemble very much human rapid language responses where you say something without thinking about it first. They are not very good at thinking though. And hopeless if you were to say hook one to a robot and tell it to fix your plumbing.
While it's debatable whether slime molds showcase intelligence, there's a substantial difference between its behavior and modern AI systems. The organism was never trained to traverse a maze. It simply behaves in the same way as it would in its natural habitat, seeking out food in this case, which we interpret as "solving" a human-made problem. In order to get an AI system to do the same we would have to "train" it on large amounts of data that specifically included maze solving. This training wouldn't carry over any other type of problem, for which we would also need to specifically train it on.
When you consider how humans and other animals learn, knowledge is carried over. I.e. if we learn how to solve a maze on paper, we can carry this knowledge over to solve a hedge maze. It's a contrived example, but you get the idea. When we learn, we build out a web of ideas in our minds which we can later use while thinking to solve other types of problems, or the same problems in different ways. This is a sign of intelligence that modern AI systems simply don't have. They're showing an illusion of intelligence, which as I've said before, can still be very useful.
My alternative definition would be something like this. Intelligence is the capacity to solve problems, where a problem is defined contextually. This means that what is and is not intelligence is negotiable in situations where the problem itself is negotiable. If you have water solve a maze, then yes the water could be said to have intelligence, though that would be a silly way to put it. It’s more that intelligence is a material phenomenon, and things which seem like they should be incredibly stupid can demonstrate surprisingly intelligent behavior.
LLMs are leagues ahead of viruses or proteins or water. If you put an LLM into a code editor with access to error messages, it can solve a problem you create for it, much like water flowing through a maze. Does it learn or change? No, everything is already there in the structure of the LLM. Does it have agency? No, it’s a transparently deterministic mapping from input to output. Can it demonstrate intelligent behavior? Yes.
That's an interesting way of looking at it, though I do disagree. Mainly because, as you mention, it would be silly to claim that water is intelligent if it can be used to solve a problem. That would imply that any human-made tool is intelligent, which is borderline absurd.
This is why I think it's important that if we're going to call these tools intelligent, then they must follow the processes that humans do to showcase that intelligence. Scoring high on a benchmark is not a good indicator of this, in the same way that a human scoring high on a test isn't. It's just one convenient way we have of judging this, and a very flawed one at that.
I keep on trying this wolf cabbage goat problem with various permutations, let’s say just a wolf and a cabbage, no goat mentioned. At some step the got materializes in the answer. I tell it there is no goat and yet it answers again and the goat is there.
I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
Output orientation - Is the output is similar to what a human would create if they were to think.
Process orientation - Is the machine actually thinking, when we say its thinking.
I met someone who once drew a circuit diagram from memory. However, they didn’t draw it from inputs, operations, to outputs. They started drawing from the upper left corner, and continued drawing to the lower right, adding lines, triangles and rectangles as need be.
Rote learning can help you pass exams. At some point, it’s a meaningless difference between the utility of “knowing” how engineering works, and being able to apply methods and provide a result.
This is very much the confusion at play here, so both points are true.
1) These tools do not “Think”, in any way that counts as human thinking
2) the output is often the same as what a human thinking, would create.
IF you are concerned with only the product, then what’s the difference? If you care about the process, then this isn’t thought.
To put it in a different context. If you are a consumer, do you care if the output was hand crafted by an artisan, or do you just need something that works.
If you are a producer in competition with others, you care if your competition is selling Knock offs at a lower price.
> IF you are concerned with only the product, then what’s the difference?
The difference is substantial. If the machine was actually thinking and it understood the meaning of its training data, it would be able to generate correct output based on logic, deduction, and association. We wouldn't need to feed it endless permutations of tokens so that it doesn't trip up when the input data changes slightly. This is the difference between a system with _actual_ knowledge, and a pattern matching system.
The same can somewhat be applied to humans as well. We can all either memorize the answers to specific questions so that we pass an exam, or we can actually do the hard work, study, build out the complex semantic web of ideas in our mind, and acquire actual knowledge. Passing the exam is simply a test of a particular permutation of that knowledge, but the real test is when we apply our thought process to that knowledge and generate results in the real world.
Modern machine learning optimizes for this memorization-like approach, simply because it's relatively easy to implement, and we now have the technical capability where vast amounts of data and compute can produce remarkable results that can fool us into thinking we're dealing with artificial intelligence. We still don't know how to model semantic knowledge that doesn't require extraordinary amounts of resources. I believe classical AI research in the 20th century leaned more towards this direction (knowledge-based / expert systems, etc.), but I'm not well versed in the history.
But if you need a submarine that can swim as agiley as a fish then we still aren't there yet, fish are far superior to submarines in many ways. So submarines might be faster than fish, but there are so many maneuvers that fish can do that the submarine can't. Its the same with here with thinking.
So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.
The point here (which is from a quote by Dijkstra) is that if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to.
So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.
> if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to
But the point is that the desired result isn't achieved, we still need humans to think.
So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.
>If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
No I completely agree that they are different, like swimming and propulsion by propellers - my point is that the difference may be irrelevant in many cases.
Humans haven't been able to beat computers in chess since the 90s, long before LLM's became a thing. Chess engines from the 90s were not at all "thinking" in any sense of the word.
It turns out "thinking" is not required in order to win chess games. Whatever mechanism a chess engine uses gets better results than a thinking human does, so if you want to win a chess game, you bring a computer, not a human.
What if that also applies to other things, like translation of languages, summarizing complex texts, writing advanced algorithms, realizing implications from a bunch of seemingly unrelated scientific papers, and so on. Does it matter that there was no "thinking" going on, if it works?
> I think AI maximalists will continue to think that the models are in fact getting less dim-witted
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.
People are doing all manner of very sophisticated inferency stuff now - it just tends to be extremely expensive for now and... people are keeping it secret.
If it was good enough to replace people then it wouldn't be too expensive, they would have launched it and replaced a bunch of people and made trillions of dollars by now.
So at best their internal models are still just performance multipliers unless some breakthrough happened very recently, it might be a bigger multiplier but that still keeps humans with jobs etc and thus doesn't revolutionize much.
I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.
I think that commenter was disagreeing with this line:
> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"
It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.
Sorry, I can't edit it any more, but what I was trying to say is that if the authors are correct, that this distinction is philosophically meaningful, then that is the conclusion. If they are not correct, then all their papers on this subject are basically meaningless.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.