LLMs are parameter based representations of linguistic representations of the world. Relative to robot predictive control problems, they are low dimensional and static. They are batch trained using supervised learning and are not designed to manage real time shifts in the external world or the reward space. They work because they operate in abstract, rule governed spaces like language and mathematics. They are ill suited to predictive control tasks. They are the IBM 360s of AI. Even so, they are astonishing achievements.
LeCun is right to say that continuous self supervised (hierarchical) learning is the next frontier, and that means we need world models. I'm not sure that JEPA is the right tool to get us past that frontier, but at the moment there are not a lot of alternatives on the table.
See, I don't get why people say that the world is somehow more complex than the world of mathematics. I think that is because people don't really understand what mathematics is. A computer game for example is pure mathematics, minus the players, but the players can also be modelled just by their observed digital inputs / outputs.
So the world of mathematics is really the only world model we need. If we can build a self-supervised entity for that world, we can also deal with the real world.
Now, you may have an argument by saying that the "real" world is simpler and more constrained than the mathematical world, and therefore if we focus on what we can do in the real world, we might make progress quicker. That argument I might buy.
> So the world of mathematics is really the only world model we need. If we can build a self-supervised entity for that world, we can also deal with the real world.
In theory I think you are kind of right, in that you can model a lot of real world behaviour using maths, but it's an extremely inefficient lense to view much of the world through.
Consider something like playing catch on a windy day. If you wanted to model that mathematically there is a lot going on: you've got the ball interacting with gravity, fluid dynamics of the ball moving through the air, the changing wind conditions etc. yet this is a very basic task that many humans can do without really thinking about it.
Put more succinctly, there are many things we'd think of as very basic which need very complex maths to approach.
This view of simulation is just wrong and does not correspond at all to human perception.
Firstly, games aren't mathematics. They are low quality models of physics. Mathematics can not say what will happen in reality, mathematics can only describe a model and say what happens in the model. Just mathematics can not say anything about the real world, so a world model just doing mathematics can not say anything about the world either.
Secondly, and far worse for your premise, is that humans do not need these mathematical models. I do not understand the extremely complex mechanical problem of opening a door, to open a door. A world model which tries to understand the world based on mathematics has to. This makes any world model based on mathematics strictly inferior and totally unsuited to the goals.
The world of mathematics is only a language. The (Platonic) concepts go from simple to very complex, but at the base stands a (dynamic and evolving) language.
The real world however is far more complex and perhaps rooted in a universal language, but in one we don’t know (yet) and ultimately try to describe and order by all scientific endeavors combined.
This philosophy is an attempt to point out that you can create worlds from mathematics, but we are far from describing or simulating ‘Our World’ (Platonic concept) in mathematics.
Danijar Hafner just left DeepMind. He's behind the Dreamer series of models which are IMO the most promising direction for world models anyone has come up with yet. I'm wondering where he's headed. Maybe he could end up at LeCun's startup?
In Dreamer 4 they are able to train an agent to play Minecraft with enough skill to obtain diamonds, without ever playing the game at all. Only by watching humans play. They first build a world model, then train the agent purely in scenarios imagined by the world model, requiring zero extra data or experience. Hopefully it's obvious how generating data from a world model might be useful for training agents in domains where we don't have datasets like the entire internet just sitting around ready-made for us to use.
And the pendulum swings back toward representation. It is becoming clear that the LLM approach is not adequate to reach what John McCarthy called human-level intelligence:
Between us and human-level intelligence lie many problems. They can be summarized as that of succeeding in the "common-sense informatic situation". [1]
> It is becoming clear that the LLM approach is not adequate to reach what John McCarthy called human-level intelligence
Perhaps paradoxically, if/as this becomes a consensus view, I can be more excited about AI. I am an "AI skeptic" not in principle, but with respect to the current intertwined investment and hype cycles surrounding "AI".
Absent the overblown hype, I can become more interested in the real possibilities (both immediate, using existing ML methods; and the remote, theoretical capabilities follow from what I think about minds and computers in general) again.
I think when this blows over I can also feel freer to appreciate some of the genuinely cool tricks LLMs can perform.
I always felt like one of reasons LLMs are so good is that they piggyback on the many years that have gone into developing language as an information representation/compression format. I don’t know if there’s anything similar a world model can take advantage of.
That being said there have been models which are pretty effective at other things that don’t use language, so maybe it’s a non issue.
I think there is a lot of merit to this approach. Ultimately we live in a world guided by physics and macro-level perception driven by our senses and our own motor control. Of course newtonian physics is not the end all be all -- cell biology or quantum mechanics works on a very different level... but what is important here is that we know that human beings understand these things and make novel discoveries on these things using a thinking apparatus that was pre-trained on large scale newtonian physics. I've always found that even in advanced mathematics my mind always uses low level geometric analogies. So the "embeddings" or priors that can be obtained are probably much better than what can be done through text correlation as with LLMs. It's very different to learn the word bounce through observation of a physical model of a ball bouncing vs. seeing what other words it co-occurs with.
Because they are smart enough to realize current LLM tech is nearing a dead end and cannot serve as a full AGI, even ignoring context and hallucination issues, without actual knowledge of the real world.
I played with Marble yesterday, Fei-Fei/World Labs' new product.
It is the most impressed I've been with an AI experience since the first time I saw a model one-shot material code.
Sure, its an early product. The visual output reminds me a lot of early SDXL. But just look at what's happened to video in the last year and image in the last three. The same thing is going to happen here, and fast, and I see the vision for generative worlds for everything from gaming/media to education to RL/simulation.
I wasn't actually able to use it because the servers were overloaded. What exactly impressed you (or more generally, what does it actually let you do at the moment?).
What you get is a 3D room based on the prompt/image. It rewrites your prompt to a specific format. Overall the rooms tend to be detailed and imaginative.
Then you can fly around the room like in Minecraft creative mode. Really looking forward to more editing features/infill to augment this.
I’m trying to understand the conversation around “world models.” Why is Tesla’s FSD rarely mentioned in these discussions? Their system perceives, reasons, and acts in the physical world, and they train it using large-scale simulation/digital-twin environments. In what sense does FSD not count as a world model—or does it, and I’m missing something?
I don't know why you're focusing on Tesla to the exclusion of more successful self-driving efforts like Waymo, but yeah, cars moving around in and predicting the real world are pretty interesting in this regard.
Everytime I see LeCun talk about world models, I can’t help but think it is also just a tweak on the fundamentals of what is behind current LLM technology. In the end it’s still neural networks. To me, having to “teach” the model how physics works makes me think it can’t be true AGI either.
In "From Words to Worlds: Spatial Intelligence is AI’s Next Frontier" Li states directly "I’m not a philosopher", proceeds to make a philosophical argument that elevates visual perception as basis for evolution of intelligence.
I think video and agentic and multimodal models have led to this point, but actually making a world model may provide to be long and difficult.
I feel LeCun is correct that LLMs as of now have limitations where it needs an architectural overhaul. LLMs now have a problem with context rot, and this would hamper with an effective world model if the world disintegrates and becomes incoherent and hallucinated over time.
It'd doubtful whether investors would be in for the long haul, which may explain the behavior of Sam Altman in seeking government support. The other approaches described in this article may be more investor friendly as there is a more immediate return with creating a 3D asset or a virtual simulation.
A trillion dollars are now riding on that white whale. An entire naval fleet is being raised for the purposes of chasing down that whale. LeCun and Fei-Fei merely believe that the whale is in a different ocean.
With all due respect, AI is ultimately a capital game. World models aren’t where real B2B customer revenue comes from—at least compared to today’s LLMs; they’re mainly a better story for raising huge amounts of private capital. Hopefully they figure out how to build the next-gen AI architecture along the way.
The most useful models are image, video, and audio models. It makes sense that we'd make the video models more 4D aware.
Text really hogged all the attention. Media is where AI is really going to shine.
Some of the most profitable models right now are in music, image, and video generation. A lot of people are having a blast doing things they could legitimately never do before, and real working professionals are able to use the tools to get 1000x more done - perhaps providing a path to independence from bigger studios, and certainly more autonomy for those not born into nepotism.
As long as companies don't over-raise like OpenAI, there should be a smooth gradient from next gen media tools to revolutionary future stuff like immersive VR worlds that you can bend like the Matrix or Holodeck.
And I'll just be exceedingly chuffed if we get open source and highly capable world models from the Chinese that keep us within spitting distance of the unicorns.
Fundamentally what AGI is trying to do is to encode ability to logic and reason. Tokens, images, video and audio are all just information of different entropy density that is the output of that logic reasoning process or emulation of logic reasoning process.
I mean both, and in AI today, they’re deeply intertwined. The “capital game” isn’t just about money—it’s about access to compute, talent, and time. Whoever has the resources can experiment, iterate, and potentially uncover the next big architecture. That financial power naturally translates into influence—control over the market, narrative, and ecosystem. In practice, the investment game and the market ruler’s game often become the same thing.
AI might be the biggest transfer of wealth from the rich to the poor in history. Billions have been poured into closed sourced models which have led directly and indirectly to open weight models being available to everyone.
It's not just the cost, but the freedom to do what you want... With open weight models I can run them on my own hardware on the edge, work with data I am not cool with uploading, experiment with different interfaces, use them for things the original trainers did not intend, even retrain the model a bit.
I am developing a p2p program where the model runs on the end user's computer. So I don't even need to pay money for each user and have a bunch of infrastructure monetize them. It is a game changer and allows for a completely different architecture.
That’s awesome, but I think we’re kinda talking past each other. I was responding to the claim that these models represent the largest wealth transfer from rich to poor in history. In order for that to be true, these models, closed or open, need to have value for average people. I don’t see that at all. Most use it as a glorified google, some are actively harmed by the sycophantic tendencies of the models.
Edit: I’d like to add that I personally get a lot of value out of the models. They’ve helped me learn to do frontend development very quickly at my job. That said, that hasn’t translated into higher pay. The expectations have risen with employee capacity.
Well that makes sense. Perhaps it is not a transfer of wealth to the poor, but a transfer of power to the middle class.
I would say this: in the future I think we are gonna have all sorts of robotics that will be able to use LLMs and vision models and stuff to do basic reason and coordination to automate a ton of tasks. The average person is basically going to be able to fit a micro-factory in their house that can knit all of their clothes, make circuit boards for all of the computers they need, stitch their wounds together, and such.
In the future, we won't even need to engage in the economy of mass production, and we will basically all be low effort self-sufficient sustainable farmers and manufacturers due to AI reducing the effectiveness of economies of scale.
No one will have conventional jobs, so we will each recreate the old economy on a tiny scale to avoid the expensive monopolies. A single person's job would be like operating a tiny factory that produces a certain type of insulin or a certain antibiotic, or some sort of resistor or tobacco or something. Like the idea of family farms extended to the industrial domain.
And all of this progress is being taken on for free at massive cost by these AI companies that think it will have the exact opposite effect, which is monetizable.
I think that LLMs can be used as a far more advanced search than google. Imagine you have some project that requires a certain part. You could spend hours browsing the internet for the best deal, or you run a local LLM that scrapes websites and does the shipping calculations and runs a reasoning model to decide if it is a good fit based on the criteria you give it, etc. You essentially have the shopping done for you, it is just a matter of one person designing the framework and open sourcing it.
Most searching isn't so much finding a direct answer to your query, but scoping out a general field of information where you don't even know what it is you want to know. LLMs give us the opportunity to script general reasoning tasks.
Maybe it is bad or neutral for labor in the short term, but in the long run I think it is worse for capital. A lot of the moat that capital has is the ability to organize labor. If anyone with a computer can do the work of 100 men, then when the 100 men get laid off they will all ask themselves "why can't I also just start a competitor where I automate all the tasks in the company?".
Just to clarify, are you starting from the point of view that AI simply does all the jobs we currently have? Were this the case, they’d also surely build robots and design AI themselves yeah? Labor as we know it wouldn’t exist anymore, because would simply be impossible to do useful work.
We’re willing to fork over money for things because those things require human effort to obtain, and we’d rather not expend it. In this new world, everything from the extraction of raw materials to the production of advanced technology would require no human effort. If our modern notions of property still persisted, however, then this doesn’t mean that people would simply have whatever they wanted. You need trees to get apples, you need a hole in the ground to get coal. Ultimately the limiting factor on everything would come down to land. Labor-time is replaced with land-time, because the land works itself. Not having land in this society would be like not having limbs or a brain in ours. You would have nothing to exchange in order to get the things you needed.
So I’d say that either the notion of property itself would change, or people without property would die until everyone had some amount of it, and people would generally occupy their time with various games that shuffled around the limited amount of land available as a proxy for social status. The flawed assumption that you make is that people would all have some amount of land in which to make their microfactory, but this would only be the case after lots of people died.
Not really. It would be good at doing generic, well-defined tasks but bad at doing specialized, novel tasks. You would still need some humans in the loop to get to the bottom of niche problems.
I agree that it would still go to hell without some type of Georgism or UBI or socialism. I agree that wealth will transfer to companies that control industrial means of production (like 3M or mining companies or intel or something), but it will also transfer out of companies whose moat is based on control of human capital (like accounting, software development, and law).
I think that even before AI, we are already seeing this sort of "land is everything" economy. Physical labor has largely been automated in the industrial revolution. Intellectual labor has been displaced not by newfangled AI mechanisms, but by information storage mediums and general pre-AI automation. If you are an artist, you are competing with all of the art that came before you. If you are an engineer, you are reinventing the wheel working on some sort of project that, if open sourced, would only need to be done once.
A major sense in which AI eliminates jobs is by acting as a bypass for copyright, it allows you to plausibly make a near-copy of something without a license. There is simply not an infinite amount of demand in the economy for intellectual labor. The thing that destroys the world as we know it is not so much AI, but information sharing and de-duplication of work. Open source would have destroyed the economy if AI didn't.
So everyone is working in the service sector now, it's unsustainable. Property prices keep going up, fertility rate keeps going down.
Pretty similar to social media in a lot of ways. They've strip mined the commons and provided us a corporate controlled walled garden to compensate us for our loss.
If I was smarter, I would have predicted that not only would everyone else figure out that world models are a critical step, but that as a direct consequence the term "world model" would lose all meaning. Maybe next time. That said, Le Cunn's concept in the blog post is the only one worthy of the title.
The naming collision here is unfortunate since the two kinds of models described couldn't be any more different in purpose. Maybe JEPA-type world models should explicitly be called "predictive world models".
When you thought to yourself, "I think therefore I am," in what language did you think it? In English? The English language is an artifact of a community of English speakers. You can't have a language with grammatical rules without a community of speakers to make that language.
Almost nobody in the English-speaking community has direct access to the internals of your mind. The community learns things through consensus, e.g. via the scientific method. We know things in English via a community of English-speaking scientists, journalists, historians, etc. etc. Wittgenstein calls these the "structures of life," the ordinary day-to-day work we do to figure out what's true and false, likely and unlikely.
As you're probably aware, the scientific method has long struggled to find a "mind" in the brain doing the thinking; all we can find are just atoms, molecules, neurons, doing things, having behaviors. We can't find "thoughts" in the atoms. As far as our ordinary day-to-day scientific method is concerned, we can't find a "mind."
But "cogito ergo sum" isn't part of the scientific method. We don't believe "cogito ergo sum" because reproducible experiments have shown it to be true. "Cogito ergo sum" proposes a way of knowing disconnected from the messy structures of life we use in English.
So, perhaps you'd say, "oh, good point, I suppose I didn't think 'cogito ergo sum' in English or Latin or whatever, I thought it in a private language known only to me. From this vantage point, I only have direct knowledge of my own existence and my own perceptions in the present moment (since the past is uncertain), but at least I can have 100% certainty of my own existence in that language."
The problem is, you really can't have a private language, not a language with words (terms) and grammatical rules and logical inferences.
Suppose you assigned a term S to a particular sensation you're having right now. What are the rules of S? What is S and what is not S? Are there any rules for how to use S? How would you know? How would you enforce those rules over time? In a private language, there's no difference between using the term S "correctly" or "incorrectly." There are no rules in a private language; there can't be. Even mathematical proofs are impossible when every term in the proof means anything you want.
Descartes didn't originally write "cogito ergo sum" in Latin. He originally published it in French, "je pense, donc je suis." But in Europe, where Descartes was writing, Latin was the universal language, the one known to all sorts of people across the continent. For Descartes, Latin was the language of empire, the language every civilized person knew because their ancestors were forced to learn it at the point of a sword, the language of absolutes.
Wittgenstein has a famous line, "Whereof one cannot speak, thereof one must be silent." So must we be silent about "cogito ergo sum." "cogito ergo sum" isn't valid in Latin; "je pense, donc je suis" isn't valid in French. It could only be valid in an unspeakable private language, a language with no grammatical rules, no logic, where true and false are indistinguishable. "Cogito ergo sum" could only be valid in an unusable language where everything is meaningless.
That's a lot of words to claim that language has exist before thought can, which gets disproved in an instant when your audience points to the large number of fauna on earth that has no language and yet displays thought.
That's not what I'm arguing. The argument is that "cogito ergo sum" is invalid, which is part of an argument against the existence of a "mind" above and beyond what the brain does in a living body. The atoms are all there is.
I don't think I have a "mind" above and beyond my body, and I don't think you do, either. Animals can remember stuff, solve puzzles, and express pain, just like you or I do. We do all that with our brains, not with our "minds."
The problem with making universal assertions as opposed to existential assertions is that a single counterexample is all that is necessary to prove the assertion is incorrect or wrong.
> That's not what I'm arguing.
Okay; your argument is difficult to digest because, unlike most philosophy arguments, you neither lead nor end with the actual thesis; you present a book-length text as support for a thesis that is never stated.
> The argument is that "cogito ergo sum" is invalid, which is part of an argument against the existence of a "mind" above and beyond what the brain does in a living body. The atoms are all there is.
What's your thesis, then? "Cogito ergo sum is invalid" is hardly a thesis. Maybe you are asserting that there is no "mind" above and beyond the living brain, which will be a universal claim not an existential one.
If that is indeed your claim, then it's not a testable/falsifiable one anyway; you are going to require instead a sequence of premises that are each accepted by the audience you wish to sway, with intermediate conclusions that are likewise accepted by the audience, before you present your final conclusion based exclusively on the premises list.
A narrative is not a good way to present a philosophical argument, especially when it is a counter argument to an argument that was presented (even if only verbally at the time) in the standard logical format I described.
A better way to convince that any formally presented logic (as Cogito ergo sum was) is invalid (or unsound) is to attack the premises. It is not normal to ignore the premises of the original argument and present premises of your own.
(PS. It's been a long time since I was in a formal logic philosophy class and maybe things have changed, but they haven't (I hope!) changed so much that logic is completely thrown out the window in favour of narrative)
Language, and especially its mechanics like grammar, are entirely a distraction w.r.t. "cogito ergo sum". The underlying argument it points to is language-independent.
Correct. Here is the stub of a reply I can't be assed to finish right now:
Words and language refer to sensations (P.I. §244: "How do words refer to sensations?").
Sensations can exist independently of language to refer to them (P.I. §256: "—But suppose I didn’t have any natural expression for the sensation, but only had the sensation?").
Thus it can be possible for one to experience the cogito, the mere act of awareness, independently of language. The point of the cogito is its self-evidence, prior to language even entering the picture as a sign standing for or referring to the self-evident sensation of conscious awareness.
I note that you keep saying "cogito" without the "ergo."
"I think therefore I am" is invalid, and what's wrong with it is the "therefore," the idea that you knew one thing, and you drew a "logical" conclusion from it, in a "prior to language" environment where words have no meaning, where "true" and "false" are indistinguishable, and logic is impossible.
Logic requires words. "Logical" means "verbal," from the Greek logos (λόγος). You can't have a logical argument (you can't draw a conclusion) from the instantaneous standpoint of someone "experiencing" cogito, where words mean whatever you want, or nothing at all.
The experience you're having is not a logical argument. As a sentence, "cogito ergo sum" is invalidated as soon as you write it down in a shared language.
I'm sure it feels right to you! But you can't actually say anything true about it in English, or Latin, or any other shared language.
For, on the one hand, there is the real world, and on the other,
a whole system of symbols about that world which we have in our minds.
These are very very useful symbols; all civilization depends on them;
but like all good things they have their disadvantages, and the
principle disadvantage of symbols is that we confuse them with reality,
just as we confuse money with actual wealth; and our names about
ourselves, our ideas of ourselves, our images of ourselves, *with*
ourselves.
Now of course, reality, from a philosopher's point of view, is a
dangerous word. A philosopher will ask me, what do I mean by reality?
Am I talking about the physical world of nature, or am I talking about
a spiritual world, or what?
And to that I have a very simple answer. When we talk about the material
world, that is actually a philosophical concept - so in the same way, if
I say that reality is spiritual, that's also a philosophical concept -
and reality itself is not a concept.
Reality is - [...]
... and we won't give it a name.
The last refuge of the Cartesian is always, "My argument is correct in an ineffable way that I couldn't possibly write down."
"Cogito ergo sum" presents itself as a self-evident deduction, the one guaranteed universally agreeable truth, but, when you investigate it a little… oh, well, it's really more of a vibe than an argument, and isn't "logical argument" really a monkey-mind distraction from the indescribable lightness of existence?
If you define "logic" as requiring words, then it's only a model of casuality, which is real irrespective of life entirely.
You're demanding that language perfectly convey an abstract argument, which is obviously unreasonable, and saying that since it can't do that we can't discuss tricky subjects at all, which if you take this line of reasoning seriously is all of them. So how about you "remain silent".
The first link is about consciousness. The second link is that language is not thought. The third link is that intentions as "discrete mental states" may not be found in the brain.
This does not necessarily disprove the existence of an world model, and these papers are not directly dealing with the concept. As shown with how LLMs work, the world model (and how the brain thinks) may be more implicit rather than explicit philosophical/psychological constructs within the neural net of the brain.
As opposed to neural nets and LLMs, neuroscientists and the like have no way of taking out a human brain and hooking it up to a computer to run experiments on what our neurons are doing. These software are the next best thing at the moment of determining what neurons can do and how they work.
Perhaps there is a dialectical synthesis that can be made of your position that I interpret to be something like "there does not exist discrete cartesian states within the brain" with how neural nets learn concepts implicitly through statistics.
The first link is about how philosophy and psychology is used to describe brain-cognitive behavior research, which has a limited explanatory capability compared to a hypothetical interpretation using its own vocabulary instead of those borrowed from other fields.
The second link is about an AI that detects consciousness in coma patients.
The third link is about how coma is associated with a low-complexity and high-predictability passive cortical state. Kickstarting the brain to a high-complexity and low-predictability state of cortical dynamics is a sign of recovery back to consciousness.
The LLM grift is burned up, so this is the next thing. It has just enough new magic tricks to wow the VCs who don't really get what's going on here. I think this comment from the article says it all:
“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”
One problem with VR and VFX is how expensive it is in terms of man hours to create immersive worlds. This significantly reduces the cost and has applications in all sorts of ways and could realistically improve the availability of content in VR and reduce movie production costs. And that’s just the obvious applications (ignoring that these world models can be used to train AI itself)
who wants to spend time consuming AI art? If the costs are low, then there is no moat to create movies or gaussian splat VR games, and therefore no reason to spend money on movies or VR splat games.
Is the artist the paint brush or the mind behind it creating the vision?
A lot of vfx today is automated and things are possible that we’re just too cost prohibitive before. You could say “who wants to see digital art”. The moat is the artist realizing their vision - for the same $ spend you get significantly more art or higher quality art (eg first pass by AI with humans doing the refinement steps).
The boom in television is because of plummeting production and distribution costs for example
I'm sure there are other valid reasons, but I think the most obvious one is that LLMs are not improving as fast as money asks for so we're moving to the next buzzword.
LeCun is right to say that continuous self supervised (hierarchical) learning is the next frontier, and that means we need world models. I'm not sure that JEPA is the right tool to get us past that frontier, but at the moment there are not a lot of alternatives on the table.