*Why* do we ever need to transpose a matrix? Isn't it better to simply combine t...

hogepodge · 2025-06-07T00:25:15 1749255915

You're right that a good graph compiler will do this for you. There still may be times, like if you're interfacing with another library, where you'll need to switch a matrix between row major or column major layouts.

meindnoch · 2025-06-07T12:33:10 1749299590

Serious linear algebra libraries expect a flag that tells if elements are column-major or row-major.

throwawayabcdef · 2025-06-07T00:20:45 1749255645

The next operation might need the data in column major order to read it fast. So you might have to transpose first. And these maybe be concurrent stages of a processing pipeline.

viraptor · 2025-06-07T01:27:03 1749259623

Now I'm curious, how many times do you have to fully read the matrix in GPU for the total impact of reading columns to be higher than one-off actual transpose and then sequential row reads? I know it depends on lots of things, I'm after a rough estimate.

saagarjha · 2025-06-07T09:59:56 1749290396

It's quite rare. Usually problems are tiled anyway and you can amortize the cost of having data in the "wrong" layout by loading coalesced in whatever is the best layout for your data and then transposing inside your tile, which gives you access to much faster memory.

stephencanon · 2025-06-07T15:50:32 1749311432

The one pure transpose case that does come up occasionally is an in-place non-square transpose, where there is a rich literature of very fussy algorithms. If someone managed to make any headway with compiler optimization there, I'd be interested.

fulafel · 2025-06-07T06:14:46 1749276886

This could make Mojo look even better as it would ld be more compute heavy and the last step thread reduction would be less relevant.