So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.
Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.
transpose_naive - Basic implementation with TMA transfers
transpose_swizzle - Adds swizzling optimization for better memory access patterns
transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling
Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:
transpose_naive: 1056.08 GB/s (32.0025% of max)
transpose_swizzle: 1437.55 GB/s (43.5622% of max)
transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
via the GitHub - simveit/efficient_transpose_mojo
Comparing to the CUDA implementations mentioned in the article:
Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s
So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.