Porting SSE to AVX code (with equivalent instruction and proper vzeropper) will ...

Const-me · on June 30, 2020

It will increase performance if you have sufficient amount of dense data on input.

When that’s the case, especially if the numbers being crunched are 32-bit floats, there’s not much point of doing it on CPU at all, GPGPUs are way more efficient for such tasks.

However, imagine sparse matrix * dense vector multiplication. If you rarely have more than 4 consecutive non-zero elements in rows of the input matrix, and large gaps between non-zero elements, moving from SSE to AVX or AVX512 will decrease the performance, you’ll be just wasting electricity multiplying by zeros.

tarlinian · on June 30, 2020

So in some sense very similar to SKX behavior? The first iteration of the instruction implementation requires judicious use of instructions, while later implementations (this is something to be upset about...those "later implementations" should have been available quite some time ago).

This is also ignoring the fact that none of these penalties come into play if you use the AVX512 instructions with 256-bit or 128-bit vectors. (This still has significant benefits due to the much nicer set of shuffles, dedicated mask registers, etc.)

google234123 · on June 30, 2020

AVX to AVX512 will "increase performance in most case"s. https://www.researchgate.net/figure/Speedup-from-AVX-512-ove...