If you play with and get a feel for the text 2 image generations, that "compress" 5 billion images into a 16GB model that coherent pictures can be probabilistically generated from, you can apply that "feel" to the language probabilistic generations -- and trust them about as much.
You're staring at a lovely image, decide to ignore eight fingers on the left hand, and not till five minutes later realize your hero has three legs.
If you play with and get a feel for the text 2 image generations, that "compress" 5 billion images into a 16GB model that coherent pictures can be probabilistically generated from, you can apply that "feel" to the language probabilistic generations -- and trust them about as much.
You're staring at a lovely image, decide to ignore eight fingers on the left hand, and not till five minutes later realize your hero has three legs.
For the purpose you describe, that works though!