using AVX/FMA and unrolling loops does extremely little in the way of compiling ...

marshallward · on April 1, 2024

I just did a test of OpenBLAS with Intel-compiled BLAS, and it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I was wrong here. Maybe in some sense 4% is not bad, but it's certainly not good. My faith in current compilers has certainly been shattered quite a bit today.

Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.

jart · on April 2, 2024

The inequality between matrix multiplication implementations is enormous. It gets even more extreme on GPU where I've seen the difference between naïve and cuBLAS going as high as 1000x. Possibly 10000x. I have a lot of faith in myself as an optimization person to be able to beat compilers. I can even beat MKL and hipBLAS if I focus on specific shapes in sizes. But trying to beat cuBLAS at anything makes me feel like Saddam Hussein when they pulled him out of that bunker.

marshallward · on April 2, 2024

I'm sure there's more to it, but just comparing the profile output shows aggressive use of prefetch and broadcast instructions.

jart · on April 2, 2024

BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.

marshallward · on April 1, 2024

I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.