I totally agree that the resulting kernel will be rarely useful. I just wanted to highlight that it is a commonly used educational exercise to showcase how to optimize for memory throughput. If the post showed how to fuse a transpose + rmsnorm epilogue to a gemm then the kernel would be more functional but the blog post would be much harder to follow for newcomers.
Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.
It's less that the result is kind of useless and more that hitting memory throughput on a simple algorithm like this is not very difficult. It takes a complex example to actually have trouble doing this.
Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.