Thanks for the reply and adding some additional context. I'm also a vision resea...

Thanks for the reply and adding some additional context. I'm also a vision researcher, fwiw (I'll be at CVPR if you all are).

(Some of this will be for benefit of other HN non-researcher readers)

I'm hoping you can provide some more. Are these training on single video moving through these environments, where the camera is not turning? What I am trying to understand is what is being generated vs what is being recalled.

It may be a more contentious view, but I do not think we're remotely ready to call these systems "world models" if they are primarily performing recall. Maybe this is the bias from an education in physics (I have a degree), but world modeling is not just about creating consistent imagery, but actually being capable of recovering the underlying physics of the videospace (as opposed to reality which the videos come from). I've yet to see a demonstration of a model that comes anywhere near this or convinces me we're on the path towards this.

The key difference here is are we building Doom which has a system requirements of 100MB disk and 8MB RAM with minimal computation or are we building a extremely decompressed version that requires 4GB of disk and a powerful GPU to run only the first level and can't even get critical game dynamics right like shooting the right enemy (GameNGen).

The problem is not the ability to predict future states based on previous ones, the problem is the ability to recover /causal structures/ from observation.

Critically, a p̶h̶y̶s̶i̶c̶s̶ world model is able to process a counterfactual.

Our video game is able to make predictions, even counterfactual predictions, with its engine. Of course, this isn't generated by observation and environment interaction, it is generated through directed programming and testing (where the testing includes observing and probing the environment). If the goal was just that, then our diffusion models would comparatively be a poor contender. It's the wrong metric. The coherence is a consequence of the world modeling (i.e. game engine) but it coherence can also be developed from recall. Recall alone will be unable to make a counterfactual.

Certainly we're in research phase and need to make tons of improvements, but we can't make these improvements if we're blindly letting our models cheat the physics and are only capable of picking up "user clicks fire" correlating with "monster usually dies when user shoots". LLMs have tons of similar problems with making such shortcuts and the physics will tell you that you are not going to be able to pick up such causal associations without some very specific signals to observe. Unfortunately, causality can not be determined from observation alone (a well known physics result![0]). You end up with many models that generate accurate predictions, and these become non-differentiable without careful factorization, probing, and often requiring careful integration of various other such models. It is this much harder and more nuanced task that is required of a world model rather than memory.

Essentially, do we have "world models" or "cargo cult world models" (recall or something else).

That's the context of my data question. To help us differentiate the two. Certainly the work is impressive and tbh I do believe there is quite a bit of utility in the cargo cult setting, but we should also be clear about what is being claimed and what isn't.

I'm also interested in how you're trying to address the causal modeling problem.

[0] There is much discussion on the Duhem-Quine thesis, which is a much stronger claim than I stated. There's the famous Michelson-Morley experiment, which actually did not rule out an aether, but rather only showed that it had no directionality. Or we could even use the classic Heisenberg Uncertainty Principle which revolutionized quantum mechanics showing that there are things that are unobservable, leading to Schrodinger's Cat (some weird hypotheses of multiverses). And we even have String Theory, which the main gripe remains that it is non-differentiable from other TOES due to differences in predictions being non-observable.