Faster CRC32-C on x86

dragontamer · on Aug 1, 2022

In short: there are three 32-bit crc32 pipelines on modern Intel CPUs, but also a clmul (carry less multiplication) instruction in AVX on a separate pipeline.

clmul was designed as a more general crc32 accelerator for the SIMD instructions set. (It is sufficiently general to also do Reed Solomon and Elliptical Curves. ARM has an equivalent pmul instruction if you are curious).

The traditional methods are to either use crc32 instruction, or the CLMUL instruction.

This blog post uses both instructions for maximum speed. Modern processors can execute different pipelines in parallel. So by placing CLMUL and crc32 instructions next to each other, you get the parallel execution with high efficiency.

It is tricky to calculate crc32 in parallel using two different instructions / methodologies interleaved. But this blog post accomplishes that.

corsix · on Aug 1, 2022

The math is fiddly to get right, but (as the author) I'd suggest that the disadvantage is very tight coupling to the CPU implementation: the interleaving is based on the relative speeds of the two methodologies, so if the relative speeds of the two methodologies drastically changes on a future CPU implementation, this _particular_ interleaving could end up _slower_ than either methodology on its own.

dragontamer · on Aug 1, 2022

Indeed.

My personal thoughts is that we should design a CPU where these kinds of pipelines / executions are more explicit, and then write magic compilers that can pull parallelism out of our programs to be in the more explicit parallelism form that this new CPU would prefer. You'd still be tied to an architecture, but moving to a new architecture (ie: 2x SIMD pipelines in the future) would be as easy as recompiling, in theory.

Then I realized that I've reinvented VLIW / Intel Itanium. And that's a silly, silly place and we probably shouldn't go there again :-p

--------

The MIMD (multiple-instruction multiple data) abilities of modern CPUs are quite amazing in any case, and its always fun to take advantage of it. Even with a singular instruction stream like in this example, it is obvious that modern CPUs have gross parallelism at the instruction level.

Its a bit of a shame that these high-performance toys we write are kind of unsustainable... requiring in depth assembly knowledge and microarchitecture-specific concepts to optimize (that often become obsolete as these designs inevitably change every 5 years or so). Then again, its probably a good idea to practice writing code at this level to remind us that the modern CPU is in fact a machine with defined performance characteristics that we can take advantage of...

eklitzke · on Aug 1, 2022

A sufficiently smart compiler can obviously do any optimization that we can think of, but in practice the kind of optimization described in this article is very difficult to actually implement in an optimizing compiler. The CRC algorithm literally operates one bit at a time, in sequential order. As the article mentions, there are some properties of how the CRC math work that allow you to calculate the CRC from multiple chunks in a stream and then combine them in a specific way. This is the kind of thing that an optimizing compiler just isn't going to be able to figure out on its own given a naive CRC implementation because it's a math insight much moreso than it is a code generation trick.

In principle you could probably use some kind of symbolic math solving library to try to detect optimization opportunities like this in general code. In practice it just wouldn't be worth it, even at -O3, because it would add a ton of CPU and memory overhead to compilation and the optimization would be very rarely applicable in the first place.

dragontamer · on Aug 1, 2022

This "run two loops in parallel" pattern is an _incredibly_ common pattern for these "high speed benchmarks". A whole lot of different programs in the high-speed space (this CRC32, the lookup3 Jenkins Hash, My own AES random number generator, etc. etc.).

Instead of the programmer manually thinking of the "two independent pipelines", its quite possible that we can imagine a language where the two pipelines were programmed separately, and then a compiler merges them together knowing about the pipeline details (ex: Skylake is 1-per-clock throughput / 3-clock latency, AMD might be different like 1-per-clock throughput / 4-clock latency or something).

The programmer's job would be to "separate out the two independent threads of execution", while the compiler's job is to "merge them back together into one instruction stream".

Much like how SIMD-code is written today, the programmer is responsible for finding the parallelism. The compiler / machine is responsible for efficient execution.

--------

As it is, today's high speed programmers have to do both tasks simultaneously. We have a mental model of the internal registers, pipelines, throughput, latencies of different execution units, and manually schedule them to match our mental model. (But that mental model changes as new CPUs come out every few years).

The hard part is figuring out the parallelism. I don't think we have a language that describes this fine-grained parallelism though, in any case. Just a "what if we lived in a magically perfect world" kinda hypothetical here...

EDIT: Alternatively, you can "cut dependencies" and hope that the compiler discovers your (intended) low dependency chain. IE: Manually unroll loops and whatever, which works better in practice than you might think (and such manual unrolling often seems to trigger the AVX autovectorizer in my experience, if you're lucky).

retrac · on Aug 1, 2022

> a CPU where these kinds of pipelines / executions are more explicit

A VLIW-style architecture is found on signal processors and other dedicated number-crunchers, where people are actually going to write custom assembly routines fully exploiting the hardware. For example, the original NEC uPD7720 from 1980 has separate fields in its instruction word. You get two ALU operations (where the operands don't depend on each other) and an address operation per instruction, all performed in parallel in one clock cycle.

Intel would later take that idea for the i860 (long before Itanium) and the i860 can run in a mode where it always fetches two instructions per cycle, one integer and one floating point. Code generation with a compiler for this turned out to be very difficult to exploit.

Another ten years later, compiler technology has improved a lot, right? Maybe it'll work now. Cue Itanium. Ah. Maybe not. The idea is probably going to be revisited every decade or so until compiler technology has actually advanced enough to pull it off :P

323 · on Aug 1, 2022

Wasn't this the Gentoo philosophy?

You downloaded the system source code and recompiled everything targeted to your exact CPU specs.

dragontamer · on Aug 1, 2022

Well, its a bit different when you design a CPU around the Gentoo philosophy as opposed to an OS around the Gentoo philosophy.

When a CPU is designed around that, everyone needs to recompile each generation for maximum performance. The theory is that bytecode like Java recompiles efficiently though (or perhaps NVidia PTX bytecode, a SIMD bytecode that recompiles each NVidia GPU generation for a higher performance example with more support from the hardware).

------

Intel Itanium was a 200x era design that was supposed to be like this, but x86-64 from AMD ended up being faster in practice.

NVidia's PTX really changed things, as well as the habit of GPU programmers for writing very, very small "programs" (called kernels) that's managed by a separate chip (IE: cpu manages the kernels, compiles / calls them as appropriate, etc. etc.). It works out in GPU land, but maybe will never work in CPU land (unless CPU-land picks up upon this kernel-invoke abstraction? Intel ispc for example has the model, as does OpenMP target offload... and thread-pools and Go arguably have it as well)

R0b0t1 · on Aug 1, 2022

A CPU built around the Gentoo philosophy would look like https://github.com/SpinalHDL/VexRiscv ;). Don't want an MMU? Fine. Need a larger RAM interface? You got it. Barrel ALU for DSP? Sure.

Interpreted languages work by consolidating all of the optimization effort in the interpreter. This is similar to how CPUs work now, instead of extremely specific optimizations that are hard to create distributed among all code we use very general optimizations that push the limits of mathematics that is centralized in a CPU.

-----

Itanium had a lot of contemporary issues that made it not work. I would certainly blame Intel's business practices and reputation for a large part of it. There are likely niches for such processors. The VLIW is useful for DSP or graphics. Indeed, the only extant VLIW (that I know of) processor is the Russian Elbrus. I think the VLIW is only included to let them reuse a lot of the core logic of the CPU to drive a DSP engine, useful for radar and scientific simulation, though the sci sim would probably use commercial hardware which would be faster.

It works on GPUs because they're doing DSP, basically. We could have weirder topologies for GPUs however, like a massive string of ALUs driven off an embedded core, so you try to kachunk all your data in a single clock domain after configuring the ALU string.

Dylan16807 · on Aug 1, 2022

If you designed the instruction set around making parallelism and pipelines fast for the CPU to parse, in a way that encourages and accelerates out-of-order scheduling, you'd have some interesting potential in something that is very different from Itanium.

dragontamer · on Aug 1, 2022

There's some very interesting flags at the lowest level of NVidia SASS (machine code) that the PTX compiler adds.

https://arxiv.org/pdf/1804.06826.pdf

PDF Page 14 (physical page 12) shows that every instruction on NVidia Volta SASS has 4-bits of "Reuse flags", 6-bits of "wait barrier mask", 3-bits of "Read Barrier Index", 3-bits of "Write Barrier Index", 1-bit "Yield Flag" and 4-bits "Stall Cycles".

> Wait barrier mask; Read/Write barrier index. While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. Volta, Pascal and Maxwell use dependency barriers to track the completion of variable-latency instructions and resolve data hazards. When a variablelatency instruction writes to a register, the assembler associates it to one of the 6 available barriers by setting the corresponding write barrier number field. When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that barrier in the wait barrier mask. The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the wait barrier mask is a bitmask, not an index.

--------

It seems like NVidia has invented something very interesting here. I don't quite understand it myself, but its quite possibly what you're talking about.

__EVERY__ instruction, on NVidia machines Volta and newer (and one control-every 3 instructions on Pascal, and 7 on older Kepler systems). So these "bundles" could be seen as "better-Itanium" back in Kepler/Pascal, but perhaps the modern NVidia GPU core is fast enough to have such pre-compiled dynamic behavior / barrier interpretation every instruction these days.

It seems like a lot of instruction-bandwidth to eat up though, since each instruction on Volta is 128-bits / 16-bytes long since there's so much control information

hansvm · on Aug 3, 2022

This sort of thinking can unlock up to roughly a factor of two in a lot of architectures and a lot of operations (depending on cache friendliness, instruction decoding, and other such things that might interfere with apparent wins).

Classic example: Intel Haswell chips can only do addition on one of their execution units, so you double your throughput by doing a fused multiply-add on the other with a multiplicand of 1.

Classic example: For large enough matrix multiplications you can load your data in such a way that the problem really is CPU-bound, even if you have to load from disk. Double up your data size with an integer representation of the matrix and do some integer-backed emulated floating point instructions for a chunk of the matrix and floating point for the other. You reduce the factor in the O(n^(2.7..3)) and add some extra O(n^2) work and some extra space (assuming you don't waste too many instructions on full IEEE compliance and don't need it for that application).

And so on. It's a fun trick, and often the compiler isn't able to execute it effectively. The same idea applies with all but the most trivial loops; until recently (last 10 years or so?), gcc wouldn't interleave cumulative sums to reduce the data dependencies (just keep 4 running totals instead of 1, skip by 4 each loop, and merge the sums and any stragglers at the end -- distinct from loop unrolling in that the major gain is from better pipelining rather than just reduced loop overhead -- it still works with vectorized accumulators). Not that it matters if you're IO-bound (which you often are on that sort of problem), but for mediumish datasets the idea still applies.

323 · on Aug 1, 2022

This raises the question: how many 32-bit crc32 pipelines do modern AMD CPUs have?

corsix · on Aug 1, 2022

https://uops.info/html-instr/CRC32_R64_R64.html answers that for you. Zen2 and Zen3 same as Intel: latency 3, throughput 1. Older AMD chips less good.

hrgiger · on Aug 1, 2022

Also amd avx2 machines faster than intel one but the current benchmarks probably didnt have hardware on hand, googles highway numbers also intel based, I wanted to test also, i got machine but having laziest days :)

323 · on Aug 1, 2022

But the question was how many execution ports it has, not what's the latency/throughput of one port.

corsix · on Aug 1, 2022

Given latency 3 / throughput 1, the only reasonable implementations are: A) Three ports, each non-pipelined, taking 3 cycles B) One port, with three-cycle pipeline (each cycle, one instruction can enter the start of the pipeline, and anything in-progress moves forward one stage) Given that CRC isn't too hard to pipeline, and (B) requires less physical hardware, it is almost certainly (B).

MrFuchs · on Aug 1, 2022

From the port usage reported at https://uops.info/html-instr/CRC32_R64_R64.html, we can conclude (B) for the Intel microarchitectures. For AMD it's not entirely obvious, but I agree (B) appears more likely.

nynx · on Aug 1, 2022

I think it’s that there’s one crc32 pipeline, but it can have up to three ops running through it at once, at different stages. I could be misreading the article though!

MrFuchs · on Aug 2, 2022

Indeed, only one port supports the corresponding instruction on Intel CPUs.

jscipione · on Aug 1, 2022

Is there a license associated with this code that I could use to include in an MIT licensed open source project? There are a couple of file system drivers that could use a faster crc32 check.