There is something off with the explanation. At first, there is 16 fetches per r...

alvinwan · 2025-10-07T13:32:27 1759843947

That’s a valid point - I’m assuming infinite register capacity at that point in the post.

The next section discusses what you’re talking about eg, how to deal with finite register/shared capacity by splitting the k dimension. I’ll mention the shared/register memory limitation sooner to clarify confusion.

imtringued · 2025-10-07T17:41:44 1759858904

The overall problem with your blog post is that it is beating around the bush rather than getting to the point. Overall, it feels like the blog post is explaining tiling in reverse order of what is needed to understand it.

"How effective is tiling?" and "Why tiling tiling is so fast" should be at the end, while the key section "Why there's a limit to tiling" which should be front and center is in the middle, followed by a subversion of the entire concept in "How to sidestep tiling limits"

It's also incredibly jarring to read this:

"Wondering how we were able to reduce memory usage "for free"? Indeed, the reduction wasn't free. In fact, we paid for this reduction a different way — by incurring more writes."

This is again, completely backwards. Let's assume you don't have a cache at all, you'll have to write out everything to DRAM every single time. The opposite is also true. Imagine you had an infinite number of registers. Every addition operation will accumulate into a register, which is a write operation. Hence, the number of write operations doesn't change.

Really the main points should be in this order: 1. matrix multiplication works best with square or almost square matrices. 2. registers and SRAM (including caches) is limited, forcing you to process matrices of finite size (aka tiles) 3. memory hierarchy means that the biggest matrix you can store at a given hierarchy gets bigger. 4. you can split matrix multiplication using inner and outer products 5. outer products take few inputs and have many outputs/accumulators, inner products take many inputs and have few outputs/accumulators. 6. You want to calculate the biggest outer product you can get away with, since this significantly reduces the memory needed to store inputs and maximizes number of cycles doing calculations, once you hit the limit, you want to reuse the accumulator, so you calculate inner products of outer products.

alvinwan · 2025-10-09T01:20:37 1759972837

I see, thanks for the feedback - the current blog post’s flow certainly isn’t optimal. I’ll try reordering to eliminate jarring bits and see how it flows.