There is a fourth way of writing SIMD code that is not among the three mentioned...

aengelke · 2024-03-26T10:08:39 1711447719

It works and is easier to read, but there's an important catch: it is not obvious when the compiler falls back to an inefficient scalar implementation that might even be less efficient than not using SIMD at all. Example: https://godbolt.org/z/McKfxjbaj

exDM69 · 2024-03-26T11:12:00 1711451520

The example you show has integer division and multiplication.

SSE or AVX instructions don't have an integer division instruction so neither GCC or Clang will emit SIMD code for the division.

For the multiplication you need to bump up GCC to `-O3` optimization to get SIMD instruction but Clang will give the correct answer even on `-O2`. Both emit a `vpmullq` instruction.

At a general level, I agree with you: when writing SIMD code you need to keep an eye on your benchmark results and/or the compiler generated assembly. Typically you're doing some optimization work when working with SIMD so keeping a close eye one the performance is a good idea anyways.

unwind · 2024-03-26T08:38:52 1711442332

Very good point, thanks!

The ever-popular Mr Lemire often uses this technique, I think it's way easier to follow than straight up assembly. See for instance [1], his post about prefix-recognition in strings (not randomly chosen, it was simply the first post with code that I found when I searched for "simd").

[1]: https://lemire.me/blog/2023/07/14/recognizing-string-prefixe...

exDM69 · 2024-03-26T08:47:04 1711442824

The article you link to uses Intel intrinsics, not built-in vector extensions.

Intrinsics are probably the most popular method of doing SIMD but they are not portable to other CPUs and are quite difficult to discover due to shorthand naming conventions and lack of general level documentation.

Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy. Situation is worse for ARM.

Galanwe · 2024-03-26T12:27:22 1711456042

> Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy.

So true. One of the most valuable skill to grow when using Intel intrinsics is a fluent navigation and intuition of the intrinsics library.

Nowadays things are much easier though, there are a lot of indexing websites and softwares where you can query, search, find intrinsics very quickly. Intel had an interactive intrinsics doc already 10 years ago.

unwind · 2024-03-26T09:52:49 1711446769

Oops, that's true! I should have dug harder. Thanks for pointing that out.

janwas · 2024-03-26T10:33:58 1711449238

I believe these drawbacks are mostly solved with our github.com/google/highway library which provides portable intrinsics, reasonably documented and grouped into categories :)

neonsunset · 2024-03-26T12:29:53 1711456193

C# has stable cross-platform SIMD API since .NET 7 (mind you, it had x86 and ARM intrinsics even before, including SIMD, but version 7 added API unification of Vector128/256/512(8) to have a single code path for multiple platforms).

It is very nice to use and produces readable(!) code.

pjmlp · 2024-03-26T12:08:02 1711454882

.NET and Java (in preview) also have SIMD support.

taminka · 2024-03-26T09:33:14 1711445594

aren’t compiler built in simd vector types practically equivalent to what the compiler would vectorise scalar code to anyway

astrange · 2024-03-26T17:18:25 1711473505

No, autovectorization is a lie and doesn't work. We're not stupid, we did it the way the article does it for a reason.

The main problems preventing it from working are that the compiler doesn't have enough information about alignment (especially if it's intentionally unaligned) or aliasing.

It tends to get things especially wrong when it autovectorizes code that was already partly hand vectorized; it ends up emitting the preludes (which correct for misalignment) twice.

It does work better for languages like Fortran made for it - contrary to popular belief C is not actually a low level language and isn't that easy to optimize.

huppeldepup · 2024-03-26T10:50:58 1711450258

It’s been a while (10 years?) since I wrote NEON and vf code (different arch but sentiment remains) and every now and then I used to run a quick disassembly to verify that my optimisation hadn’t been undone by the compiler. The first time it happens you kinda lose faith in the compiler.

exDM69 · 2024-03-26T11:02:06 1711450926

No, compilers may be able to vectorize basic loops, but they won't for example turn four adjacent additions into a vector addition.

JonChesterfield · 2024-03-26T11:24:49 1711452289

There's (at least) two vectorisers in LLVM. One does loops, the other acts on basic blocks. https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer. Both really like alignment metadata on pointers. They certainly have limitations but the hit rate is reasonably good.