You're right that a good graph compiler will do this for you. There still may be times, like if you're interfacing with another library, where you'll need to switch a matrix between row major or column major layouts.
The next operation might need the data in column major order to read it fast. So you might have to transpose first. And these maybe be concurrent stages of a processing pipeline.
Now I'm curious, how many times do you have to fully read the matrix in GPU for the total impact of reading columns to be higher than one-off actual transpose and then sequential row reads? I know it depends on lots of things, I'm after a rough estimate.
It's quite rare. Usually problems are tiled anyway and you can amortize the cost of having data in the "wrong" layout by loading coalesced in whatever is the best layout for your data and then transposing inside your tile, which gives you access to much faster memory.
The one pure transpose case that does come up occasionally is an in-place non-square transpose, where there is a rich literature of very fussy algorithms. If someone managed to make any headway with compiler optimization there, I'd be interested.
Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?