Took a lot of failed experiments, the model would keep converging to greyscale / sepia images. Think one of the ways I fixed was by adding an greyscale encoder to the arch. Used its output embedding as additional conditioning. Can't remember if I only added it to the Unet input or injected it during various stages of the unet down pass.