There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.
I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.
There's still some room for experimenting if you care about memory/power efficiency, like MoE models, but they're not as well understood yet.