There is a fourth way of writing SIMD code that is not among the three mentioned in the article: using language built-in features for SIMD.
For C and C++ this is GCC vector extensions (also available in Clang), for Rust (nightly) it's std::simd.
Compared to other methods, this gives portability to different CPU architectures, ability to use basic arithmetic operators (+, -, *, /), vector widths wider than the CPU supports and compiler optimizations that would be inhibited by inline assembly.
It may not cover every instruction available in the CPU, but there is a zero cost fallback to intrinsics when you need a particular CPU instruction (e.g. __mm_rcp_ps).
I've written a significant amount of SIMD code in C, C++ and Rust using builtins and it's quite a pleasant experience compared to other methods.
It works and is easier to read, but there's an important catch: it is not obvious when the compiler falls back to an inefficient scalar implementation that might even be less efficient than not using SIMD at all. Example: https://godbolt.org/z/McKfxjbaj
The example you show has integer division and multiplication.
SSE or AVX instructions don't have an integer division instruction so neither GCC or Clang will emit SIMD code for the division.
For the multiplication you need to bump up GCC to `-O3` optimization to get SIMD instruction but Clang will give the correct answer even on `-O2`. Both emit a `vpmullq` instruction.
At a general level, I agree with you: when writing SIMD code you need to keep an eye on your benchmark results and/or the compiler generated assembly. Typically you're doing some optimization work when working with SIMD so keeping a close eye one the performance is a good idea anyways.
The ever-popular Mr Lemire often uses this technique, I think it's way easier to follow than straight up assembly. See for instance [1], his post about prefix-recognition in strings (not randomly chosen, it was simply the first post with code that I found when I searched for "simd").
The article you link to uses Intel intrinsics, not built-in vector extensions.
Intrinsics are probably the most popular method of doing SIMD but they are not portable to other CPUs and are quite difficult to discover due to shorthand naming conventions and lack of general level documentation.
Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy. Situation is worse for ARM.
> Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy.
So true. One of the most valuable skill to grow when using Intel intrinsics is a fluent navigation and intuition of the intrinsics library.
Nowadays things are much easier though, there are a lot of indexing websites and softwares where you can query, search, find intrinsics very quickly. Intel had an interactive intrinsics doc already 10 years ago.
I believe these drawbacks are mostly solved with our github.com/google/highway library which provides portable intrinsics, reasonably documented and grouped into categories :)
C# has stable cross-platform SIMD API since .NET 7 (mind you, it had x86 and ARM intrinsics even before, including SIMD, but version 7 added API unification of Vector128/256/512(8) to have a single code path for multiple platforms).
It is very nice to use and produces readable(!) code.
No, autovectorization is a lie and doesn't work. We're not stupid, we did it the way the article does it for a reason.
The main problems preventing it from working are that the compiler doesn't have enough information about alignment (especially if it's intentionally unaligned) or aliasing.
It tends to get things especially wrong when it autovectorizes code that was already partly hand vectorized; it ends up emitting the preludes (which correct for misalignment) twice.
It does work better for languages like Fortran made for it - contrary to popular belief C is not actually a low level language and isn't that easy to optimize.
It’s been a while (10 years?) since I wrote NEON and vf code (different arch but sentiment remains) and every now and then I used to run a quick disassembly to verify that my optimisation hadn’t been undone by the compiler. The first time it happens you kinda lose faith in the compiler.
There's (at least) two vectorisers in LLVM. One does loops, the other acts on basic blocks. https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer. Both really like alignment metadata on pointers. They certainly have limitations but the hit rate is reasonably good.
For C and C++ this is GCC vector extensions (also available in Clang), for Rust (nightly) it's std::simd.
Compared to other methods, this gives portability to different CPU architectures, ability to use basic arithmetic operators (+, -, *, /), vector widths wider than the CPU supports and compiler optimizations that would be inhibited by inline assembly.
It may not cover every instruction available in the CPU, but there is a zero cost fallback to intrinsics when you need a particular CPU instruction (e.g. __mm_rcp_ps).
I've written a significant amount of SIMD code in C, C++ and Rust using builtins and it's quite a pleasant experience compared to other methods.