> Scalar loopy C code sure. The auto vectorization is not great. Stop considerin...

Const-me · 2024-03-26T15:25:39 1711466739

> show us your results

Not GP but here’s an example where intrinsics outperformed assembly by an order of magnitude: https://news.ycombinator.com/item?id=36624240

They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.

It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.

Glacia · 2024-03-26T15:40:52 1711467652

So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?

I think you're completely missing what are we talking about here.

Const-me · 2024-03-26T16:28:31 1711470511

> Realy?

No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.

I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.

Assembly is unreliable due to OS-specific shenanigans, result in bugs like that one: https://issues.chromium.org/issues/40185629

Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.

jbk · 2024-03-26T18:20:39 1711477239

> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs

People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…

Const-me · 2024-03-26T18:57:44 1711479464

> writing video codecs for years

These people aren’t the only ones writing performance-critical SIMD code. I’ve been doing that for more than a decade now, even wrote articles on the subject like http://const.me/articles/simd/simd.pdf

> that you use daily

The video codecs I use daily are mostly implemented in hardware, not software.

> it’s a lot faster (10-20%)

Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

BTW, it’s often possible to rework source code and/or adjust compiler options to improve performance of the machine code generated from SIMD intrinsics, diminishing these 10-20% to something like 1-2%.

Optimizations like that are obviously less reliable than using assembly, also relatively tricky to implement because compilers don’t expose enough knobs for low-level things like register allocation.

However, the approach still takes much less time than writing assembly. And it’s often good enough for many practical applications. Examples of these applications include Windows software shipped in binaries, and HPC or embedded where you can rely not just on a specific compiler version, but even on specific processor revision and OS build.

jbk · 2024-03-26T20:32:40 1711485160

> Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

You cheery pick my comments and cannot be bothered reading.

We’re talking against fully optimized-autovec-all-llvm-options vs hand written asm. And yes, 800% is likely.

The 20% is intrinsics vs hand written.

> The video codecs I use daily are mostly implemented in hardware, not software.

Weirdly, I know a bit more about the transcoding pipelines of the video industry that you do. And it’s far from hardware decoding and encoding over there…

You know nothing about the subject you are talking about.

kierank · 2024-03-26T20:06:22 1711483582

v210_planar_pack_8_c: 2298.5

v210_planar_pack_8_ssse3: 402.5

v210_planar_pack_8_avx: 413.0

v210_planar_pack_8_avx2: 206.0

v210_planar_pack_8_avx512: 193.0

v210_planar_pack_8_avx512icl: 100.0

23x speedup. The compiler isn't going to come up with some of the trickery to make this function 23x faster.

800% is nothing.

Const-me · 2024-03-26T20:45:08 1711485908

You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.

Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.

astrange · 2024-03-27T01:48:33 1711504113

The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.

The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.

Const-me · 2024-03-27T13:36:01 1711546561

Yeah, it’s clear why you can’t have a single optimized C version.

However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?

SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.