Writing x86 SIMD using x86inc.asm (2017)

exDM69 · 2024-03-26T08:15:02 1711440902

There is a fourth way of writing SIMD code that is not among the three mentioned in the article: using language built-in features for SIMD.

For C and C++ this is GCC vector extensions (also available in Clang), for Rust (nightly) it's std::simd.

Compared to other methods, this gives portability to different CPU architectures, ability to use basic arithmetic operators (+, -, *, /), vector widths wider than the CPU supports and compiler optimizations that would be inhibited by inline assembly.

It may not cover every instruction available in the CPU, but there is a zero cost fallback to intrinsics when you need a particular CPU instruction (e.g. __mm_rcp_ps).

I've written a significant amount of SIMD code in C, C++ and Rust using builtins and it's quite a pleasant experience compared to other methods.

aengelke · 2024-03-26T10:08:39 1711447719

It works and is easier to read, but there's an important catch: it is not obvious when the compiler falls back to an inefficient scalar implementation that might even be less efficient than not using SIMD at all. Example: https://godbolt.org/z/McKfxjbaj

exDM69 · 2024-03-26T11:12:00 1711451520

The example you show has integer division and multiplication.

SSE or AVX instructions don't have an integer division instruction so neither GCC or Clang will emit SIMD code for the division.

For the multiplication you need to bump up GCC to `-O3` optimization to get SIMD instruction but Clang will give the correct answer even on `-O2`. Both emit a `vpmullq` instruction.

At a general level, I agree with you: when writing SIMD code you need to keep an eye on your benchmark results and/or the compiler generated assembly. Typically you're doing some optimization work when working with SIMD so keeping a close eye one the performance is a good idea anyways.

unwind · 2024-03-26T08:38:52 1711442332

Very good point, thanks!

The ever-popular Mr Lemire often uses this technique, I think it's way easier to follow than straight up assembly. See for instance [1], his post about prefix-recognition in strings (not randomly chosen, it was simply the first post with code that I found when I searched for "simd").

[1]: https://lemire.me/blog/2023/07/14/recognizing-string-prefixe...

exDM69 · 2024-03-26T08:47:04 1711442824

The article you link to uses Intel intrinsics, not built-in vector extensions.

Intrinsics are probably the most popular method of doing SIMD but they are not portable to other CPUs and are quite difficult to discover due to shorthand naming conventions and lack of general level documentation.

Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy. Situation is worse for ARM.

Galanwe · 2024-03-26T12:27:22 1711456042

> Each of the Intel intrinsic functions are well documented but figuring out which ones you need out of the 600 or so available functions is not easy.

So true. One of the most valuable skill to grow when using Intel intrinsics is a fluent navigation and intuition of the intrinsics library.

Nowadays things are much easier though, there are a lot of indexing websites and softwares where you can query, search, find intrinsics very quickly. Intel had an interactive intrinsics doc already 10 years ago.

unwind · 2024-03-26T09:52:49 1711446769

Oops, that's true! I should have dug harder. Thanks for pointing that out.

janwas · 2024-03-26T10:33:58 1711449238

I believe these drawbacks are mostly solved with our github.com/google/highway library which provides portable intrinsics, reasonably documented and grouped into categories :)

neonsunset · 2024-03-26T12:29:53 1711456193

C# has stable cross-platform SIMD API since .NET 7 (mind you, it had x86 and ARM intrinsics even before, including SIMD, but version 7 added API unification of Vector128/256/512(8) to have a single code path for multiple platforms).

It is very nice to use and produces readable(!) code.

pjmlp · 2024-03-26T12:08:02 1711454882

.NET and Java (in preview) also have SIMD support.

taminka · 2024-03-26T09:33:14 1711445594

aren’t compiler built in simd vector types practically equivalent to what the compiler would vectorise scalar code to anyway

astrange · 2024-03-26T17:18:25 1711473505

No, autovectorization is a lie and doesn't work. We're not stupid, we did it the way the article does it for a reason.

The main problems preventing it from working are that the compiler doesn't have enough information about alignment (especially if it's intentionally unaligned) or aliasing.

It tends to get things especially wrong when it autovectorizes code that was already partly hand vectorized; it ends up emitting the preludes (which correct for misalignment) twice.

It does work better for languages like Fortran made for it - contrary to popular belief C is not actually a low level language and isn't that easy to optimize.

huppeldepup · 2024-03-26T10:50:58 1711450258

It’s been a while (10 years?) since I wrote NEON and vf code (different arch but sentiment remains) and every now and then I used to run a quick disassembly to verify that my optimisation hadn’t been undone by the compiler. The first time it happens you kinda lose faith in the compiler.

exDM69 · 2024-03-26T11:02:06 1711450926

No, compilers may be able to vectorize basic loops, but they won't for example turn four adjacent additions into a vector addition.

JonChesterfield · 2024-03-26T11:24:49 1711452289

There's (at least) two vectorisers in LLVM. One does loops, the other acts on basic blocks. https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer. Both really like alignment metadata on pointers. They certainly have limitations but the hit rate is reasonably good.

JonChesterfield · 2024-03-26T08:46:44 1711442804

This turns out to be a lot of assembly macros to help write one x86 assembly. https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/x...

The sibling comment recommending compiler intrinsics is probably the best way to go for writing SIMD code. A mixture of `<i32 x 16>` style types and intrinsics to specify instructions is a solid 90% solution compared to assembly.

If you want that last 10%, I think macros are putting the emphasis in the wrong place. They're a somewhat easy way to build up a language abstraction which will work if held carefully, but I'm confident the dev experience using this abstraction when you write invalid code will be deeply confusing.

I would suggest to write a parser instead of the macros. That'll tell you clearly when the syntax is invalid (though possibly not with much precision) and it'll give you a place to put semantic analysis for where valid syntax encodes nonsense. Do the equivalent of the macro expansions on the parsed tree instead of on the text. Emit asm as the "back end".

anonymoushn · 2024-03-26T14:01:33 1711461693

So far intrinsics supplemented with functions that wrap single instructions cover my needs. The thing I occasionally want to be easier is manually configurable unrolling. For example if you have some kernel that processes the input in 3 streams, or that loads 5 registers worth of input from a single stream and processes all of those in the same way, the resulting code duplication is a useless place where typos can give you a bad time. A macro system slightly different from the one C actually has could let you get rid of the duplication and just change the constant to control the unrolling.

vkaku · 2024-03-26T12:14:39 1711455279

This ffmpeg stuff is great. Recently they highlighted the need [1] for next gen of low level developers, so this is a good opportunity for those who want to embrace this.

Getting closer to the machine is the best way to improve speed of today's software.

1 - https://twitter.com/FFmpeg/status/1772063313452466239?s=19

natsch · 2024-03-26T12:53:01 1711457581

While I agree with the sentiment, I also feel the reverse is true. A lot of companies are hiring specifically for AI/ML at the moment, because it's the current thing. Performance and efficiency of their applications are somehow seldom prioritized.

neonsunset · 2024-03-26T11:45:03 1711453503

Sad to not see C# being mentioned here. It is one of the most convenient languages to write bespoke SIMD code in today.

Examples:

- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

- https://github.com/dotnet/runtime/blob/b432fc66fe59873b2b2ad...

- https://github.com/bepu/bepuphysics2/blob/05e3e8d33f5cbfd780...

- https://github.com/U8String/U8String/blob/main/Sources/U8Str...

Const-me · 2024-03-26T11:59:49 1711454389

Note assembly language is OS-specific because different OSes have different ABI. Unfortunately, this is very relevant to SIMD https://learn.microsoft.com/en-us/cpp/build/x64-software-con... And causes real-life bugs in the software with pieces written in assembly.

I think this makes assembly poor choice for writing software in modern ecosystem. Modern C and C++ compilers have SIMD intrinsics. SIMD intrinsics are OS agnostic, modern compilers follow these ABI conventions carefully.

SIMD intrinsics are ISA-specific just like assembly. Unlike higher-level abstraction libraries or automatic vectorizers, SIMD intrinsics allow to match performance of the assembly.

kierank · 2024-03-26T12:03:54 1711454634

x86inc.asm takes care of the different OS ABI differences for you.

p0nce · 2024-03-26T08:51:13 1711443073

Nowadays this is a fringe way to write SIMD code, if you use something like simd-everywhere you will be able to write SIMD code for ARM, x86, RISC-V... who share quite a bit of common ground. It's easy to think we can schedule instructions and spill registers better than compilers but only is some extreme cases. Assembly is also longer to write than intrinsics and harder to read, and doesn't get better over time as the compiler improves.

astrange · 2024-03-26T17:26:06 1711473966

> Assembly is also longer to write than intrinsics and harder to read, and doesn't get better over time as the compiler improves.

Intel intrinsics are actually kind of horrible to read, because they use Hungarian notation which is one of those strange galapagos Microsoft ideas. One reason we used assembly is that it's easier.

> It's easy to think we can schedule instructions and spill registers better than compilers but only is some extreme cases.

With GCC it used to happen every time. Modern compilers do it slightly less than every time, but they will still get the memory accesses wrong. (as I said in every other comment, C doesn't have enough information about memory alignment or aliasing.)

kookamamie · 2024-03-26T12:12:29 1711455149

Or, instead of this pretty arcane "templating" for different instruction sets, one could write some ISPC.

astrange · 2024-03-26T17:51:40 1711475500

ISPC is a better solution than autovectorization for sure, but it still wouldn't work well enough.

Glacia · 2024-03-26T09:34:37 1711445677

It's funny reading comments here, Hackernews bros really think they're smarter than the guys who made ffmpeg/x264/x265 etc. Dunning–Kruger effect in action.

JonChesterfield · 2024-03-26T11:32:35 1711452755

First you write in C. Then it's too slow, you write parts in asm. After a while you have a lot of asm. You lean harder on the macro processor. At some point you port to another processor and decide the macro processor can handle that. Bang, x86inc.asm.

That doesn't make the end point optimal. Nor does it mean it's what the authors would have done from a clean slate. At each step you take the sensible choice and after a long trek down the gradient you end up somewhere like this.

Given a desire to write something analogous to these codecs today, should you copy their development path? Should you try to copy the end result, maybe even using the same tools?

Your argument from authority amounts to "these guys are clever, you should imitate them". There are failure modes in that line of thinking which I hope the above makes clear.

astrange · 2024-03-26T17:22:04 1711473724

We aren't stupid. We designed x86inc to be like that for good reasons, from a clean slate, and if we didn't like it we would have done something else.

You haven't tested the alternatives - they're slow and don't work in this situation, mostly because C is not actually that low level when it comes to memory aliasing.

JonChesterfield · 2024-03-26T20:05:30 1711483530

Well that's much more interesting. Is there anything written publicly about the experience? Any tooling used to help get the implementation right beyond testing and writing it carefully?

I've found a little here https://ffmpeg.org/developer.html#SIMD_002fDSP-1

The context is I'm a compiler developer who really liked working side by side with old school assembly developers in a past role. I'm painfully aware that the tribal knowledge of building stuff out of asm is hard to find written down and always curious about the directions in which things like C can be extended to narrow the gap.

janwas · 2024-03-26T10:39:21 1711449561

FWIW I have been writing SIMD since 20+ years and worked on JPEG XL, which also contains a good bit of vector code.

BTW one anecdote: a colleague mentioned what should have been a quick 20 min patch to ffmpeg took a day because it was written in assembly.

kierank · 2024-03-26T11:19:54 1711451994

That "20 minute patch" will need to be maintained for decades to come in FFmpeg, long after a standalone JPEG-XL library. Potentially centuries as archives like the Library of Congress are storing FFmpeg. So that's why it's done in assembly, so it's maintainable with the rest of the code.

Glacia · 2024-03-26T11:13:10 1711451590

Here is the old archived blog post where x264 team answered in the comments why they do it that way. https://web.archive.org/web/20091223024333/http://x264dev.mu...

exDM69 · 2024-03-26T11:18:47 1711451927

This is dated to 2009. Probably sound advise back then.

Compilers are much much better with SIMD code than they were then. Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code (edit: when given SIMD code as input, see comment below).

I happen to know because this "hacker news bro" has been dealing with SIMD code for longer than that.

kierank · 2024-03-26T11:21:14 1711452074

FFmpeg code by definition is not "basic SIMD code". And it supports numerous other compilers other than LLVM.

jbk · 2024-03-26T11:36:43 1711453003

> Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code.

On dav1d, we see just a 800% increase… I know it’s negligible, but…

exDM69 · 2024-03-26T11:45:40 1711453540

Compared to what? Scalar loopy C code sure. The auto vectorization is not great.

But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.

Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.

And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.

With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.

rbultje · 2024-03-26T12:33:33 1711456413

Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.

anonymoushn · 2024-03-26T14:15:13 1711462513

I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.

Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/

jbk · 2024-03-26T11:52:43 1711453963

> Scalar loopy C code sure. The auto vectorization is not great.

Stop considering people as idiots.

People do that because it’s a LOT faster, not just a bit.

If you are so able, please show us your results. Dav1d is full open source, fully documented, and with quite simple C code.

Show your results.

Const-me · 2024-03-26T15:25:39 1711466739

> show us your results

Not GP but here’s an example where intrinsics outperformed assembly by an order of magnitude: https://news.ycombinator.com/item?id=36624240

They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.

It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.

Glacia · 2024-03-26T15:40:52 1711467652

So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?

I think you're completely missing what are we talking about here.

Const-me · 2024-03-26T16:28:31 1711470511

> Realy?

No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.

I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.

Assembly is unreliable due to OS-specific shenanigans, result in bugs like that one: https://issues.chromium.org/issues/40185629

Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.

jbk · 2024-03-26T18:20:39 1711477239

> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs

People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…

Const-me · 2024-03-26T18:57:44 1711479464

> writing video codecs for years

These people aren’t the only ones writing performance-critical SIMD code. I’ve been doing that for more than a decade now, even wrote articles on the subject like http://const.me/articles/simd/simd.pdf

> that you use daily

The video codecs I use daily are mostly implemented in hardware, not software.

> it’s a lot faster (10-20%)

Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

BTW, it’s often possible to rework source code and/or adjust compiler options to improve performance of the machine code generated from SIMD intrinsics, diminishing these 10-20% to something like 1-2%.

Optimizations like that are obviously less reliable than using assembly, also relatively tricky to implement because compilers don’t expose enough knobs for low-level things like register allocation.

However, the approach still takes much less time than writing assembly. And it’s often good enough for many practical applications. Examples of these applications include Windows software shipped in binaries, and HPC or embedded where you can rely not just on a specific compiler version, but even on specific processor revision and OS build.

jbk · 2024-03-26T20:32:40 1711485160

> Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

You cheery pick my comments and cannot be bothered reading.

We’re talking against fully optimized-autovec-all-llvm-options vs hand written asm. And yes, 800% is likely.

The 20% is intrinsics vs hand written.

> The video codecs I use daily are mostly implemented in hardware, not software.

Weirdly, I know a bit more about the transcoding pipelines of the video industry that you do. And it’s far from hardware decoding and encoding over there…

You know nothing about the subject you are talking about.

kierank · 2024-03-26T20:06:22 1711483582

v210_planar_pack_8_c: 2298.5

v210_planar_pack_8_ssse3: 402.5

v210_planar_pack_8_avx: 413.0

v210_planar_pack_8_avx2: 206.0

v210_planar_pack_8_avx512: 193.0

v210_planar_pack_8_avx512icl: 100.0

23x speedup. The compiler isn't going to come up with some of the trickery to make this function 23x faster.

800% is nothing.

Const-me · 2024-03-26T20:45:08 1711485908

You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.

Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.

astrange · 2024-03-27T01:48:33 1711504113

The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.

The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.

Const-me · 2024-03-27T13:36:01 1711546561

Yeah, it’s clear why you can’t have a single optimized C version.

However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?

SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.