GLIDE is NOT Dall-E. Dall-E is a transformer (basically GPT-3), while GLIDE is a diffusion model. While they share some similarities, the major difference is that transformers generate image sequentially from top to bottom, pixel-by-pixel (technically, token-by-token), so one can condition them only by the text and the top of the image. At the same time, diffusion models predict all pixels at the same time, so one can naturally trade compute for result quality (do more inference iterations) and, beside sampling, do other image manipulation tasks, like text-prompted inpainting.
Hosted demo, "Logic puzzle" example:
"On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book.
The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right.
Which book is the leftmost book?"
Answer:
> The black book
Same puzzle with the question "Which book is the rightmost book?"
Answer:
> The black book
I tried to ask GPT-3 and Codex this problem, they could not solve it either.