Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.
It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.
Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step
It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.
Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step