Highly efficient matrix transpose in Mojo

daft_pink · 2025-06-07T13:52:24 1749304344

I think Mojo’s lack of being a true open product and existing to drive profits at Modular has really held it back.

It’s just really impractical to use a licensed programming language in 2025.

totalperspectiv · 2025-06-07T15:35:34 1749310534

My impression is that this is on purpose on their part. They’ve repeatedly stated that by 2026 they will open source the compiler, and I think they’ve wanted a slow adoption ramp in order to spend some more time getting it right first.

Possibly rose-tinted glasses on my part, but I’m optimistic for 2026. Chris Lattner has a pretty strong track record of getting these things right.

veidr · 2025-06-07T15:57:57 1749311877

Yeah, and he's clearly trying to avoid what happened to Swift[1]. Although the danger of "corporate owner priorities dictate releasing half-baked/awful changes" risk is still there, Lattner himself has more influence within Modular (obviously, as co-founder and CEO) than he did at Apple, so it may work out better this time.

[1]: https://news.ycombinator.com/item?id=30416070

melodyogonna · 2025-06-07T16:34:49 1749314089

Yeah, Mojo's development has been pretty transparent. Chris publishes technical documents for most features and takes community feedback into account. A recent example is here: https://forum.modular.com/t/variable-bindings-proposal-discu...

Btw, Mojo's development is a masterclass in language development and community building, it's been fun watching Chris go back to fix technical debts in existing features rather than proceeding with adding new features.

daft_pink · 2025-06-07T20:19:36 1749327576

It was just very difficult primarily because of the way the license limitations and install steps made it difficult to drop it into the existing python tooling ecosystem.

I haven’t tried it in a long time, but as it’s a Python superset, I tried to drop it into my jupyter notebook docker container and you had to agree to license terms and register your email and install a modular package that contained a bunch of extra things.

If you want to get widespread adoption for a python superset, you would probably want to get it included in the official jupyter docker images as people who do this sort of programming like to use a jupyter repl, but they just made it so difficult.

I’m no open source zealot and I’m happy to pay for software, but I think the underlying language needs to be a lot more open to be practical.

GeekyBear · 2025-06-07T16:44:40 1749314680

> he's clearly trying to avoid what happened to Swift

Also to MLIR while Lattner was at Google:

> MLIR was born—a modular, extensible compiler infrastructure designed to bring order to the chaos. It brought forth a foundation that could scale across hardware platforms, software frameworks, and the rapidly evolving needs of machine learning. It aimed to unify these systems, and provide a technology platform that could harmonize compute from many different hardware makers.

But unification is hard. What started as a technical project quickly turned into a battleground: open-source governance, corporate rivalries, and competing visions all collided. What could have been a straightforward engineering win became something much more complicated.

https://www.modular.com/blog/democratizing-ai-compute-part-8...

sestep · 2025-06-06T19:53:19 1749239599

I'm not an expert in this space, but is this meaningful? I'd assume that it's more common to fuse together transposition with an operation that precedes or follows it (e.g. matmul), which should be far more efficient than materializing the entire transposition in memory if it's just an intermediate value.

musebox35 · 2025-06-07T06:24:49 1749277489

Matrix transpose is a canonical example of a memory bound operation and often used to showcase optimization in a particular programming language or library. See for example the cutlass matrix transpose tutorial from Jay Shah of flash attention 3 paper: https://research.colfax-intl.com/tutorial-matrix-transpose-i...

saagarjha · 2025-06-07T09:58:07 1749290287

Unfortunately the issue (alluded to in the blog post you linked) is that transposes do absolutely no work but memory loads. Sure, they test that you can swizzle your accesses, but modern accelerators are all about pipelining and feeding matrix multiply units, which is considerably harder than loading from memory as fast as possible. Actually, even the Mojo post barely beats CUDA for most of its kernels, because you can hit memory bandwidth for transpose on the latest hardware using techniques from 5-10 years ago. This is definitely not true for more interesting operations.

musebox35 · 2025-06-07T12:39:50 1749299990

I totally agree that the resulting kernel will be rarely useful. I just wanted to highlight that it is a commonly used educational exercise to showcase how to optimize for memory throughput. If the post showed how to fuse a transpose + rmsnorm epilogue to a gemm then the kernel would be more functional but the blog post would be much harder to follow for newcomers.

Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.

saagarjha · 2025-06-16T11:41:52 1750074112

It's less that the result is kind of useless and more that hitting memory throughput on a simple algorithm like this is not very difficult. It takes a complex example to actually have trouble doing this.

simon_vtr · 2025-06-07T13:51:47 1749304307

That was exactly my reason to write this blogpost and optimise transpose. It is a simple educational yet not trivial example to learn the basics.

saagarjha · 2025-06-07T10:07:27 1749290847

> This kernel archives a bandwidth of 1056.08 GB/s which is faster than the 875.46 GB/s we archived using CUDA. I believe that to be the reason because we use the PTX api for TMA transfers in Mojo.

I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)

simon_vtr · 2025-06-07T13:47:03 1749304023

The kernels I mention in CUDA use all the equivalent logic like the Mojo kernels. You can find them on my GitHub: https://github.com/simveit/effective_transpose You may want to provide a faster kernel on H100 via PR and I will merge after checking it’s faster.

londons_explore · 2025-06-06T23:33:19 1749252799

Why do we ever need to transpose a matrix?

Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?

hogepodge · 2025-06-07T00:25:15 1749255915

You're right that a good graph compiler will do this for you. There still may be times, like if you're interfacing with another library, where you'll need to switch a matrix between row major or column major layouts.

meindnoch · 2025-06-07T12:33:10 1749299590

Serious linear algebra libraries expect a flag that tells if elements are column-major or row-major.

throwawayabcdef · 2025-06-07T00:20:45 1749255645

The next operation might need the data in column major order to read it fast. So you might have to transpose first. And these maybe be concurrent stages of a processing pipeline.

viraptor · 2025-06-07T01:27:03 1749259623

Now I'm curious, how many times do you have to fully read the matrix in GPU for the total impact of reading columns to be higher than one-off actual transpose and then sequential row reads? I know it depends on lots of things, I'm after a rough estimate.

saagarjha · 2025-06-07T09:59:56 1749290396

It's quite rare. Usually problems are tiled anyway and you can amortize the cost of having data in the "wrong" layout by loading coalesced in whatever is the best layout for your data and then transposing inside your tile, which gives you access to much faster memory.

stephencanon · 2025-06-07T15:50:32 1749311432

The one pure transpose case that does come up occasionally is an in-place non-square transpose, where there is a rich literature of very fussy algorithms. If someone managed to make any headway with compiler optimization there, I'd be interested.

fulafel · 2025-06-07T06:14:46 1749276886

This could make Mojo look even better as it would ld be more compute heavy and the last step thread reduction would be less relevant.

totalperspectiv · 2025-06-07T15:36:45 1749310605

In the coarse graining code, you use an @parameter-for. Doesn’t that lead to some pretty large code size unrolling that? Or is that less of an issue on GPU?

Great write up! I learned a lot!

simon_vtr · 2025-06-07T16:09:30 1749312570

It doesn’t. The batch size is just 8. This is a very good trick and often needed to archive peak performance in memory bound kernels. You can checkout the equivalent code in cuda aswell :)

thunkingdeep · 2025-06-07T14:37:00 1749307020

Is the word archive used in place of achieve? I’m not sure if there is a terminology issue that I don’t understand in this post…

graycat · 2025-06-07T15:48:02 1749311282

Fast matrix transpose? Agree for a transposed matrix, just change the indexing arithmetic that converts row i and column j to an offset in the storage for the matrix and then remember that this is a transposed matrix. Some software object semantics could make this easy for other software to use.

jjtheblunt · 2025-06-07T19:23:30 1749324210

i think the problem with changing the indexing arithmetic is that you could end up with arithmetic incompatible with vector instructions in hardware that you're hoping to use for parallelism.

graycat · 2025-06-08T01:22:47 1749345767

> vector instructions

Gee, for the polar decomposition, Gauss-Seidel, etc., looked really hard for those in my IBM PC/XT and couldn't find any!!!

arjvik · 2025-06-06T20:09:11 1749240551

Where's the 14%? Looks like their final kernels show a 0.14% improvement of Mojo over the equivalent CUDA kernel?

77pt77 · 2025-06-06T20:18:42 1749241122

It looks because it does.

>(2771.35/2775.49 - 1) * 100 = -.14916285052369131300

Flagged.

timmyd · 2025-06-06T20:53:01 1749243181

Updated the title to the original. I did base the numbers on

"This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) which is still impressive

olaf · 2025-06-08T11:06:00 1749380760

AFAIK ‚let‘ was removed from the language, for me it‘s a big turn-off that the Python compatibility aspect has such a high priority. Or did I overlook something?

colesantiago · 2025-06-06T19:54:03 1749239643

Does anyone use Mojo in production at all or are even hiring for Mojo?

melodyogonna · 2025-06-07T09:30:34 1749288634

Modular (the company behind Mojo) uses it in production. I imagine that if they have any clients then those also use Mojo in production - albeit indirectly - since all the GPU kernels used by Modular are written in Mojo.

iandanforth · 2025-06-07T11:33:23 1749296003

I'm probably just ignorant but shouldn't the graphic of the tiled transpose have the green vector column-oriented in the final matrix?

somethingsome · 2025-06-07T14:05:58 1749305158

The colors are reading writing operations ;)

You have global memory and shared memory, the global is slower.

You read in rows in the global memory (faster than reading columns)

You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)

You read in rows in the shared memory (very fast)

You write in rows in the global memory (faster than writing in columns)

The idea behind that tiling is to hide the slow part in a memory that is faster.

almostgotcaught · 2025-06-07T00:30:06 1749256206

As someone said below - you'd never write just a transpose kernel - it'll be fused into something else.

saagarjha · 2025-06-07T09:53:19 1749289999

Look the frontier AI companies need something other than reversing binary trees to give interview candidates

almostgotcaught · 2025-06-07T15:18:49 1749309529

No one is going to ask this on an interview. Used to be matmul. These days it's FA.

saagarjha · 2025-06-16T11:42:54 1750074174

I think these would be rather difficult to fit in a standard 1-hour interview slot

vlan121 · 2025-06-06T20:03:56 1749240236

Mojos compiler is closed source. Thats a big no-no

dgurchenkov · 2025-06-06T23:22:59 1749252179

I work on Mojo. The whole compiler, runtime etc. will get open sourced, most likely within a year. It is just a matter of time and us getting all the required work done.

https://docs.modular.com/mojo/faq/#open-source

xiphias2 · 2025-06-07T12:48:09 1749300489

,,will get open sourced'' means closed source, parent wrote the same

GeekyBear · 2025-06-07T15:56:42 1749311802

Chris Lattner (the CEO of Modular) was previously the technical lead behind the creation of LLVM, Clang and Swift, all of which were open sourced.

He has a bit of a track record already.

xiphias2 · 2025-06-07T17:53:15 1749318795

Sure, but at that time he was employed by Apple for example.

Now he's making a for profit company and there's already MAX and MAX Enterprise stuff to not trust that the open source part would be competitive with already great inferencing frameworks for example.

almostgotcaught · 2025-06-07T00:31:58 1749256318

> runtime

Are you talking about your libc equivalent or MAX?

dgurchenkov · 2025-06-07T16:41:55 1749314515

Both.

Mojo standard library is already open source. Mojo at the moment does not need a runtime (but if it ever needs one it'd get open sourced). My point was, Mojo as a whole, as a programming language & a reference implementation, will definitely get open sourced.

MAX itself is a bigger beast to work with, and I am out of my depth to talk about it. I think it'll get open sourced as well, just the timeline might be different (shorter or longer, IDK).

melodyogonna · 2025-06-07T07:11:18 1749280278

I wonder if there is a reason for not using the high level abstractions provided by Modular

saagarjha · 2025-06-07T10:01:05 1749290465

Most interesting algorithms (e.g. with dynamic shapes, mixed computation) are typically better scheduled by hand.

Q6T46nT668w6i3m · 2025-06-07T15:53:10 1749311590

Sure, but Modular’s mission was to provide abstractions to minimize these types of optimizations.

totalperspectiv · 2025-06-07T15:38:04 1749310684

I’d also add that Mojo is new, and people are still feeling it out by trying to 1:1 things with Cuda.

jsnell · 2025-06-06T20:09:30 1749240570

The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized, the original is "Highly efficient matrix transpose in Mojo".

Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.

timmyd · 2025-06-06T23:20:19 1749252019

[op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are:

transpose_naive - Basic implementation with TMA transfers

transpose_swizzle - Adds swizzling optimization for better memory access patterns

transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling

Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:

transpose_naive: 1056.08 GB/s (32.0025% of max)

transpose_swizzle: 1437.55 GB/s (43.5622% of max)

transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)

via the GitHub - simveit/efficient_transpose_mojo

Comparing to the CUDA implementations mentioned in the article:

Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s

Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s

Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s

So there is highly efficient matrix transpose in Mojo

All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).

The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.

jsnell · 2025-06-07T00:15:04 1749255304

Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).

Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.

timmyd · 2025-06-07T00:23:42 1749255822

thanks jsnell - i did they and they appreciated the comment above, and unflagged it. i appreciate it!

atomicapple · 2025-06-06T20:14:20 1749240860

I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason

jebarker · 2025-06-06T20:13:00 1749240780

Yeah, it seems like the blog post is just meant to be an example of how to do something in Mojo and not a dunk on CUDA.

timmyd · 2025-06-06T23:45:57 1749253557

FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.

baal80spam · 2025-06-06T20:10:30 1749240630

0.14% is within the limits of statistical error. So this is a nothing-"article".

jsnell · 2025-06-06T20:14:21 1749240861

I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.

voronar · 2025-06-06T20:09:07 1749240547

Mr. Mojo Risin'

noracists · 2025-06-06T20:12:18 1749240738

htrp · 2025-06-06T20:02:42 1749240162

Left unsaid, the 14% improvement in performance came at the cost of increasing dev time by 35%

bravetraveler · 2025-06-06T20:13:40 1749240820

Reminds me of this, lol:

> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."

14% all the time vs 35% some of the time

edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone