Huh. I thought those got projected into the (512) shared CLIP space before getti...

minimaxir · on Feb 6, 2023

Only the text portion is included, otherwise that would be big.

Also, I am slightly wrong in that the first dimension will many not always be 77, since apparently there is no padding in the tokenizer. Test notebook here: https://colab.research.google.com/drive/192PDIbc2XiI1HgJQSdN...

HammadB · on Feb 6, 2023

Nope, this is why you cannot use images as prompts without some workarounds! SD doesn’t use the shared CLIP space but the text encoded before projection