Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Huh. I thought those got projected into the (512) shared CLIP space before getting passed to the conditional blocks.


Only the text portion is included, otherwise that would be big.

Also, I am slightly wrong in that the first dimension will many not always be 77, since apparently there is no padding in the tokenizer. Test notebook here: https://colab.research.google.com/drive/192PDIbc2XiI1HgJQSdN...


Nope, this is why you cannot use images as prompts without some workarounds! SD doesn’t use the shared CLIP space but the text encoded before projection




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: