Depends, given the low res, the 3x64x64 pixel space image is smaller than the la...

Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.

It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.

Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step