My suspicion is that you're leaving some important parts in your logic unstated....

phatfish · on Sept 25, 2024

It feels like if these image and video generation models were really resolving some fundamental laws from the training data they should at least be able to re-create an image at a different angle.

some1else · on Sept 26, 2024

"Allegory of the cave" comes to mind, when trying to describe the understanding that's missing from diffusion models. I think a super-model with such qualifications would require a number of ControlNets in a non-visual domains to be able to encode understanding of the underlying physics. Diffusion models can render permutations of whatever they've seen fairly well without that, though.

svara · on Sept 26, 2024

I'm very familiar with the allegory of the cave, but I'm not sure I understand where you're going with the analogy here.

Are you saying that it is not possible to learn about dynamics in a higher dimensional space from a lower dimensional projection? This is clearly not true in general.

E.g., video models learn that even though they're only ever seeing and outputting 2d data, objects have different sides in a fashio that is consistent with our 3d reality.

The distinctions you (and others in this thread) are making is purely one of degree - how much generalization has been achieved, and how well - versus one of category.