How much would it cost to colorize a movie with a fork of this?

NBJack · on Aug 3, 2023

I think the bigger question is would it be stable enough. Many SD like models struggle with consistency across multiple images (i.e. frames) even when content doesn't change much. Would he a cool problem to see tackled.

erwannmillon · on Aug 3, 2023

temporal coherence is def an issue with these types of models, though I haven't tested it out with ColorDiffusion. Assuming you're not doing anything autoregressive (from frame to frame) to do temporal coherence, you can also parallelize the colorization of each frame, which would affect cost.

Tbh most cost effective would be a conditional GAN though

lajamerr · on Aug 3, 2023

Change up the model. That allows it to see previous frames and 1-2 future frames.

Then train the model on movies that are color and then turn them black and white.

That way you can train temporal coherence.

leetharris · on Aug 3, 2023

Quick math:

24 frames per second * 60 seconds per minute * 90 minute movie length = 129600 frames

If you could get cost to a penny per frame, about $13k? But I'd bet you could easily get it an order of magnitude less in terms of cost. So $1500 or so?

And that's assuming you do 100% of frames and don't have any clever tricks there.

caturopath · on Aug 3, 2023

I'm willing to bet that if you just treated each frame as an image, it would result in some weird stuff when you played them as a movie.

> penny per frame

Where did this come from?

leetharris · on Aug 3, 2023

I do lots of large scale ML work, this was just sort of a random educated "order of magnitude" guess.

syntaxing · on Aug 3, 2023

Seems like a pretty reasonable estimate, if it cost about $2 a hour to rent a decent GPU, that’s 18s per penny which sounds pretty doable to run one frame.

erwannmillon · on Aug 4, 2023

Think inference time was on the order of 4-5seconds per image on a v100, which you can rent for like .80 cents an hour, though you can get way better gpus like a100s for ~1.1 usd/h now. But ofc this is at 64px res in pixel space.

If you wanted to do this at high res, you would definitely use a latent diffusion model. The autoencoder is almost free to run, and reduces the dimensionality of high res images significantly, which makes it a lot cheaper to run the autoregressive diffusion model for multiple steps.