I think the bigger question is would it be stable enough. Many SD like models struggle with consistency across multiple images (i.e. frames) even when content doesn't change much. Would he a cool problem to see tackled.
temporal coherence is def an issue with these types of models, though I haven't tested it out with ColorDiffusion. Assuming you're not doing anything autoregressive (from frame to frame) to do temporal coherence, you can also parallelize the colorization of each frame, which would affect cost.
Tbh most cost effective would be a conditional GAN though
24 frames per second * 60 seconds per minute * 90 minute movie length = 129600 frames
If you could get cost to a penny per frame, about $13k? But I'd bet you could easily get it an order of magnitude less in terms of cost. So $1500 or so?
And that's assuming you do 100% of frames and don't have any clever tricks there.
Seems like a pretty reasonable estimate, if it cost about $2 a hour to rent a decent GPU, that’s 18s per penny which sounds pretty doable to run one frame.
Think inference time was on the order of 4-5seconds per image on a v100, which you can rent for like .80 cents an hour, though you can get way better gpus like a100s for ~1.1 usd/h now. But ofc this is at 64px res in pixel space.
If you wanted to do this at high res, you would definitely use a latent diffusion model. The autoencoder is almost free to run, and reduces the dimensionality of high res images significantly, which makes it a lot cheaper to run the autoregressive diffusion model for multiple steps.