Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My suspicion is that you're leaving some important parts in your logic unstated. Such as belief in a magical property within humans of "understanding", which you don't define.

The ability of video models to generate novel video consistent with physical reality shows that they have extracted important invariants - physical law - out of the data.

It's probably better not to muddle the discussion with ill defined terms such as "intelligence" or "understanding".

I have my own beef with the AGI is nigh crowd, but this criticism amounts to word play.



It feels like if these image and video generation models were really resolving some fundamental laws from the training data they should at least be able to re-create an image at a different angle.


"Allegory of the cave" comes to mind, when trying to describe the understanding that's missing from diffusion models. I think a super-model with such qualifications would require a number of ControlNets in a non-visual domains to be able to encode understanding of the underlying physics. Diffusion models can render permutations of whatever they've seen fairly well without that, though.


I'm very familiar with the allegory of the cave, but I'm not sure I understand where you're going with the analogy here.

Are you saying that it is not possible to learn about dynamics in a higher dimensional space from a lower dimensional projection? This is clearly not true in general.

E.g., video models learn that even though they're only ever seeing and outputting 2d data, objects have different sides in a fashio that is consistent with our 3d reality.

The distinctions you (and others in this thread) are making is purely one of degree - how much generalization has been achieved, and how well - versus one of category.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: