You have global memory and shared memory, the global is slower.
You read in rows in the global memory (faster than reading columns)
You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)
You read in rows in the shared memory (very fast)
You write in rows in the global memory (faster than writing in columns)
The idea behind that tiling is to hide the slow part in a memory that is faster.
You have global memory and shared memory, the global is slower.
You read in rows in the global memory (faster than reading columns)
You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)
You read in rows in the shared memory (very fast)
You write in rows in the global memory (faster than writing in columns)
The idea behind that tiling is to hide the slow part in a memory that is faster.