It's the same thing. Predict the next pixel, or the next token (same way you han...

It's the same thing. Predict the next pixel, or the next token (same way you handle regular images), or infill missing tokens (MAE is particularly cool lately). Those induce the abstractions and understanding which get tapped into.