I totally agree that the resulting kernel will be rarely useful. I just wanted t...

saagarjha · 2025-06-16T11:41:52 1750074112

It's less that the result is kind of useless and more that hitting memory throughput on a simple algorithm like this is not very difficult. It takes a complex example to actually have trouble doing this.

simon_vtr · 2025-06-07T13:51:47 1749304307

That was exactly my reason to write this blogpost and optimise transpose. It is a simple educational yet not trivial example to learn the basics.