I see what you mean. I think that you can happily scale the B&W image down, run the model, and then scale the chroma information back up.
Something I was thinking about after writing the comment is that the model is probably trained on chroma-subsampled images. Digital cameras do it with the bayer filter, and video cameras add 4:2:0 subsampling or similar subsampling as they compress the image. So the AI is probably biased towards "look like this photo was taken with a digital camera" versus "actually reconstruct the colors of the image". What effect this actually has, I don't know!
good point, I hadn’t realized that you only need to predict chroma! That actully greatly simplifies things
re. chroma subsampling in training data: this is actually a big problem and a good generative model will absolutely learn to predict chroma subsampled values (or JPEG artifacts even!). you can get around it by applying random downscaling with antialiasing during training.
Something I was thinking about after writing the comment is that the model is probably trained on chroma-subsampled images. Digital cameras do it with the bayer filter, and video cameras add 4:2:0 subsampling or similar subsampling as they compress the image. So the AI is probably biased towards "look like this photo was taken with a digital camera" versus "actually reconstruct the colors of the image". What effect this actually has, I don't know!