Seems like a pretty reasonable estimate, if it cost about $2 a hour to rent a de...

erwannmillon · on Aug 4, 2023

Think inference time was on the order of 4-5seconds per image on a v100, which you can rent for like .80 cents an hour, though you can get way better gpus like a100s for ~1.1 usd/h now. But ofc this is at 64px res in pixel space.

If you wanted to do this at high res, you would definitely use a latent diffusion model. The autoencoder is almost free to run, and reduces the dimensionality of high res images significantly, which makes it a lot cheaper to run the autoregressive diffusion model for multiple steps.