Hmm. Its a good resource, but I'm pretty bearish on autovectorization in this manner.
After trying OpenCL / CUDA / ROCm style programming, its clear that writing explicitly parallel code is in fact easier than expected (albeit with a ton of study involved... but I bet anyone can learn it if they put their mind to it).
If CPU-SIMD is really needed for some reason, I expect that the languages that will be most convenient are explicitly-parallel systems like OpenMP or ISPC.
> There should be a single control flow within the loop.
> The loop should not contain function calls.
These three restrictions are extremely restrictive!! CUDA / OpenCL / ROCm allow these constructs. Control flow may have terrible performance in CUDA / OpenCL, but its allowed (because its convenient. If the programmer can't think of any way to solve a problem aside from a few more if-statements / switch-statements, then we should let them even if its inefficient).
That's the thing. We know that SIMD-machines and languages designed for SIMD-machines can have dynamic loop counts and more than one if-statement (albeit with a branch divergence penalty). We also find it extremely convenient to decompose our problems into functions and sub-functions.
---------------------
OpenMP in contrast, looks like its learning from OpenCL / CUDA. The #omp simd parallel for construct is moving closer-and-closer to CUDA/OpenCL parity, allowing for convenient "if" statements and "dynamic loop counts".
CPU-SIMD is here to stay, and I think learning how to use it is very important. But autovectorization from the compiler (without any programmer assist) looks like a dead end. The compiler gets a LOT of help when the programmer states things in terms of "threadIdx.x", and other explicitly SIMD variable concepts.
Besides, if the programmer is forced to learn all of these obscure rules / obscure programming methods (countable loops / only one control flow within the loop / etc. etc.), you're pretty much learning a sublanguage without any syntax to indicate that you've switched languages. Discoverability is really bad.
If I instead say "#pragma omp parallel simd for" before using some obscure OpenMP features, any C/C++ programmer these days will notice that something is weird and search on those terms before reading the rest of the for-loop.
A for-loop written in "autovectorized" style has no such indicator, no such "discoverability" to teach non-SIMD programmers what the hell is going on.
---------
Kind of a shame, because this resource is excellently written and still worth a read IMO. Even if I think the tech is a bit of a dead-end.
> SIMD is generally written by hand in assembly. Libjpegturbo or ffmpeg have a lot of handwritten SIMD for various platforms.
Note that CPU-SIMD written in this manner is a subdiscipline called SWAR: SIMD-within-a-register. When we compare/contrast SWAR-SIMD (such as SSE / AVX / NEON / Altivec, etc. etc.) vs GPU-SIMD (NVidia CUDA, AMD ROCm), we notice that the two styles are extremely similar at the assembly language level (!!!).
There's a few differences that lead to major performance contrasts, but when an AMD Vega-GPU performs the "V_ADD_CO_U32" assembly instruction, its very similar to the Intel-AVX512 "vpaddd" instruction (except Vega-GPU is over 64x32-bit, while AVX512 is just over 16x32-bit).
So at the assembly language level, I assert that GPU-SIMD and CPU-SIMD+SWAR are more similar than most people think.
------------
The primary difference is in the expectation in what the compiler can or can't do.
In particular: CPU-SIMD+SWAR programmers have been programming in either assembly language (back in MMX-days) or compiler-intrinsics because... lets be frank... the compiler sucks still. Autovectorization sucks so much that our only choice is to dip down into intrinsics and/or assembly language directly.
In contrast, GPU-SIMD programmers have created more-and-more elaborate compilers / transformations to support a greater variety of programming structures. Its possible to program in a SIMD-manner today using a "high level language" like CUDA / OpenCL.
-------
Based on my experiments in the GPU-programming world, I'm convinced that the OpenMP committee "gets" it. They too seem to have seen the benefits of a higher-level abstractions that CUDA / OpenCL offers, and are beginning to translate those abstractions into OpenMP. The #pragma omp simd constructs are marching in the right direction (albeit many years behind CUDA/OpenCL, but... they seem to be at least going in the right direction).
Not all of us are willing to use intrinsics / raw assembly language for the rest of our days. :-) Some of us are looking for ways to bring forth those performance benefits of SIMD into a higher-level language. There will always be a place for intrinsics / raw assembly language, but we definitely prefer for "most" code to be written in a more easily understood manner.
After trying OpenCL / CUDA / ROCm style programming, its clear that writing explicitly parallel code is in fact easier than expected (albeit with a ton of study involved... but I bet anyone can learn it if they put their mind to it).
If CPU-SIMD is really needed for some reason, I expect that the languages that will be most convenient are explicitly-parallel systems like OpenMP or ISPC.
In particular, look at these restrictions: https://cvw.cac.cornell.edu/vector/coding_vectorizable
> The loop must be countable at runtime.
> There should be a single control flow within the loop.
> The loop should not contain function calls.
These three restrictions are extremely restrictive!! CUDA / OpenCL / ROCm allow these constructs. Control flow may have terrible performance in CUDA / OpenCL, but its allowed (because its convenient. If the programmer can't think of any way to solve a problem aside from a few more if-statements / switch-statements, then we should let them even if its inefficient).
That's the thing. We know that SIMD-machines and languages designed for SIMD-machines can have dynamic loop counts and more than one if-statement (albeit with a branch divergence penalty). We also find it extremely convenient to decompose our problems into functions and sub-functions.
---------------------
OpenMP in contrast, looks like its learning from OpenCL / CUDA. The #omp simd parallel for construct is moving closer-and-closer to CUDA/OpenCL parity, allowing for convenient "if" statements and "dynamic loop counts".
CPU-SIMD is here to stay, and I think learning how to use it is very important. But autovectorization from the compiler (without any programmer assist) looks like a dead end. The compiler gets a LOT of help when the programmer states things in terms of "threadIdx.x", and other explicitly SIMD variable concepts.
Besides, if the programmer is forced to learn all of these obscure rules / obscure programming methods (countable loops / only one control flow within the loop / etc. etc.), you're pretty much learning a sublanguage without any syntax to indicate that you've switched languages. Discoverability is really bad.
If I instead say "#pragma omp parallel simd for" before using some obscure OpenMP features, any C/C++ programmer these days will notice that something is weird and search on those terms before reading the rest of the for-loop.
A for-loop written in "autovectorized" style has no such indicator, no such "discoverability" to teach non-SIMD programmers what the hell is going on.
---------
Kind of a shame, because this resource is excellently written and still worth a read IMO. Even if I think the tech is a bit of a dead-end.