Faster CRC32 on the Apple M1

DantesKite · on May 22, 2022

I really like that the author gave some context at the top. So many times I struggle to read or realize the importance of a concept because there just isn't enough context for me to follow along.

And certainly not all blogs have to, but it's nice when it is.

ntoskrnl · on May 22, 2022

So ARM64 has dedicated instructions for CRC32, but implementing it by hand using SIMD is still faster. Score another point for RISC.

adrian_b · on May 22, 2022

No, the faster implementation uses another dedicated instruction, which happens to be more general than CRC32, i.e. the multiplication of polynomials having binary coefficients.

So this has little to do with RISC, except the general principle that the instructions that are used more frequently should be implemented to be faster, a principle that has been used by the M1 designers and by any other competent CPU designers.

In this case, ARM has added the polynomial multiplication instruction a few years after Intel, with the same main purpose of accelerating the authenticated encryption with AES. There is little doubt that ARM was inspired by the Intel Westmere new instructions (announced by Intel a few years before the Westmere launch in 2010).

The dedicated CRC32 instruction could have been made much faster, but the designers of the M1 core did not believe that this is worthwhile, because that instruction is not used often.

The polynomial multiplication is used by many more applications, because it can implement CRC computations based on any polynomial, no only that one specified for CRC32, and it can also be used in a great number of other algorithms that are based on the properties of the fields whose elements are polynomials with binary coefficients.

So it made sense to have a better implementation for the polynomial multiplication, which allows greater speeds in many algorithms, including the CRC computation.

saagarjha · on May 22, 2022

Amusingly the Rosetta runtime uses crc32x

IshKebab · on May 22, 2022

That sounds like a point for RISC to me?

Ok maybe it is just a point against really complex instructions. There's clearly an optimum middle ground.

DonHopkins · on May 22, 2022

Mary Payne designed the VAX floating point POLY instruction at DEC.

But the microcode and hardware floating point implementations did it slightly differently. Then the MicroVAX dropped it, then picked it up again but wrong, then fixed it, then lost it again.

http://simh.trailing-edge.com/docs/vax_poly.pdf

https://documentation.help/VAX11/op_POLY.htm

https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

>Multiply–accumulate operation

>The Digital Equipment Corporation (DEC) VAX's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.

https://news.ycombinator.com/item?id=20558618

>VAX was a crazy town ISA too, with stuff like single isntruction polynomial evaluation.

astrange · on May 22, 2022

Complex instructions are often good ideas. They’re best at combining lots of bitshifting (hardware is good at that and it can factor things out) but even for memory ops it can be good (the HW can optimize them by knowing things like cache line sizes).

They get a bad rap because the only really complex ISA left is x86 and it just had especially bad ideas about which operations to use its shortest codes on. Nobody uses BOUND to the point some CPUs don’t even include it.

One point against them in SIMD is there definitely is an instruction explosion there, but I haven’t seen a convincing better idea, and I think the RISC-V people’s vector proposal is bad and shows they have serious knowing what they’re talking about issues.

adgjlsfhk1 · on May 22, 2022

what's wrong with the riscv approach?

astrange · on May 22, 2022

They want to go back to the 70s (not meant as an insult) and use Cray-style vector instructions instead of SIMD. Vector instructions are kind of like loops in one instruction; you set a vector length register and then all the vector instructions change to run on that much data.

That’s ok for big vectors you’d find in like scientific computations, but I believe it’s bad for anything I’ve ever written in SIMD. Games and multimedia usually use short vectors (say 4 ints) or mixed lengths (2 and 4 at once) and are pretty comfortable with how x86/ARM/PPC work.

Not saying it couldn’t be better, but the RISCV designers wrote an article about how their approach would be better basically entirely because they thought SIMD adds too many new instructions and isn’t aesthetically pretty enough. Which doesn’t matter.

Also I remember them calling SIMD something weird and politically incorrect in the article but can’t remember what it was…

atq2119 · on May 22, 2022

Isn't there a sort of obvious "best of both worlds" by having a vector instruction ISA with a vector length register, plus the promise that a length >= 4 is always supported and good support for intra-group-of-4 shuffles?

Then you can do whatever you did for games and multimedia in the past, except that you can process N samples/pixels/vectors/whatever at once, where N = vector length / 4, and your code can automatically make use of chips that allow longer vectors without requiring a recompile.

Mind you, I don't know if that's the direction that the RISC-V people are taking. But it seems like a pretty obvious thing to do.

codedokode · on May 23, 2022

In my opinion "the best" would be to support only one, fixed, largest vector size and emulate legacy instructions for smaller sizes using microcode (this is slow, but requires minimum transistor count). There is no need to optimize for legacy software; instead it should be recompiled and compiler should generate versions of code for all existing generations of CPUs.

This way the CPU design can be simplified and all the complexity moved into compilers.

brucehoult · on May 23, 2022

The entire Wintel monopoly has been built on people NOT recompiling old software. They still want it to run as fast as currently possible.

adgjlsfhk1 · on May 23, 2022

That would be awful for power consumption and only works well if everyone compiles everything from source. Otherwise every binary is awful performance for almost everyone.

jrockway · on May 23, 2022

> everyone compiles everything from source

One thing I learned from the M1 transition is that people will do it if someone tells them to. I bought a Mac this weekend to do exactly that; lots of users complaining about lack of M1 support. Time to add it (in a way that I can test). I have no choice.

astrange · on May 23, 2022

M1 has a very good x86 emulator for the moment. The users probably underestimate how good it is.

But GPU programs get compiled from source at runtime all the time, and sort of are all about vectors. (M1’s GPU doesn’t actually have vectors.)

snvzz · on May 23, 2022

Intel tried that (Itanium, with a variation of VLIW called EPIC). Didn't go so well. It was more like Epic Fail.

They poured a lot of money into the compiler. It didn't work out.

Again and again, complexity is proven cancerous. RISC is the way to go.

_0w8t · on May 23, 2022

Itanium tried to move branch prediction to the compiler from CPU. Plus it’s failure was not necessary related to CPU design, it can well be a bad business execution.

The grandparent comment is about a compiler generating multiple code copies for different CPU architecture iterations not to support all legacy instructions or at least to allow to implement those via microcode emulation in later CPUs.

userbinator · on May 23, 2022

and all the complexity moved into compilers.

Do not want. Especially seeing how hostile to programmers they've become.

devit · on May 22, 2022

The RISC-V design is the sensible one since it does not hardcode the vector register size; the other designs are idiotic since a new instruction set needs to be created every time the vector register size is increased (MMX->SSE2->AVX2->AVX-512) and you can't use shorter vector registers on lower-power CPUs without reducing application compatibility.

And it doesn't "loop" in one instruction, the vsetvl instruction will set the vector length to the _minimum_ of the vector register size and the amount of data left to process.

codedokode · on May 23, 2022

I have read about ideas to make forward-compatible vector instructions, but personally I don't see much value in such features.

If there appears a new CPU with different vector size then you can just recompile the program and let compiler use new instructions. Therefore it doesn't make sense to sacrifice chip area or performance for this feature.

Regarding backward compatibility (running old software on a next generation CPU), I think that newer CPUs do not need to support obsolete instructions from previous generations. Instead they could just include a decoder that would (using microcode and therefore slowly) emulate old instructions. This allows user run their package manager or recompile their software and switch to new instructions. It is totally fine to completely redesign the ISA provided that there is a minimal converter for old instructions or at least support for software-based translator like qemu.

brigade · on May 23, 2022

How do you handle LMUL>1 without "looping" across the multiple destination registers?

brucehoult · on May 23, 2022

Well, for example the Alibaba T-Head Xuantie C910 has 128 bit vector registers but a 256 bit vector ALU, so both LMUL=1 and LMUL=2 instructions execute in a single clock cycle. LMUL=8 instructions take 4 cycles to execute, but that's a 1024 bit vector right there.

Also, the person who said graphics uses 3 or 4 element vectors so you only need short vector registers didn't understand how a vector processor such as RISC-V will be programmed for that take. Usually you have thousands of XYZ or XYZW points that you want to transfor my the same matrix or normalize or whatever. You don't load one point into one vector register. You load all the X coordinates from 4 or 8 or 64 or 256 points into one vector register, all the Y coordinates into another vector register, all the Z coordinates into another one. And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.

astrange · on May 23, 2022

> And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.

What if you have a 4-vector and want its dot product? x86 SIMD has that. (It’s slow on AMD but was okay for me.)

Vector instructions could add that too, because they can add whatever they want. I’m just suspicious it would actually be efficient in hardware to be worth using.

ARM is developing SVE but is allowing hardware to require only powers of 2 vector sizes IIRC, which surely limits how you can use it.

brucehoult · on May 23, 2022

The vector registers are powers of two in size, but both ARM SVE and RISC-V V allow you to conveniently process any vector shorter than the register size -- by setting the VL register in RISC-V or by using a vector mask in SVE (RISC-V also does vector element masking) -- or longer than the vector register (by looping, with the last loop possibly being smaller than the vector size)

"What if you have a 4-vector and want its dot product?"

Dot product with what?

Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.

But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.

People used to short SIMD registers look at vector problems along the wrong axis.

astrange · on May 23, 2022

> Dot product with what?

Meant to say “two 4-vectors”.

> Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors?

Unfortunately not. I was thinking of a raytracer there, and it doesn’t have any more data available. I could speculate some more data or rewrite the program entirely to get some more, but the OoO and multi-core CPU is a good fit for the simplest pixel at a time approach to raytracing.

In the other case I’d use SIMD for, video codecs, there is definitely not any more data available because it’s a decompression algorithm and so it’s maximally unpredictable what the next compressed bit is going to tell you to do.

brigade · on May 23, 2022

Xuantie 910 has 2x128b ALUs, which is kinda my point that the decoding of RISC-V vector instructions to uops is more complex than it needs to be, because it depends on the value of user-settable registers. XT-910 also highlights that with their speculation of vsetvl, something no other vector ISA has to do to be performant.

avianes · on May 24, 2022

While I agree that the RISC-V V extension is questionable, it does not necessarily make decoding into uop more complex. When the vector register is allocated there is a known vector register size, and we have then 3 ways to go:

1) Producing several uops for each cut in the vector register;

2) Producing a single uop with a size field, the vector unit know how to split the operation;

2) Producing a single uop size-agnostic, the vector unit know the size from the vector register metadata.

The worst is probably 1), it will stress the ROB, uop queue and the scheduler, and make the decoding way more complex. But solutions 2) and 3) keep the decoding simple and are much more resource efficient.

An implementation aspect that may be complicated is the vector register allocation depending on the vector-unit microarchitecture (more specifically, depending on the vector register file microarchitecture)

adgjlsfhk1 · on May 22, 2022

To me, the much bigger problem with SIMD is that it is really hard to program for since vector sizes keep getting bigger, and it really doesn't look good at all for big-little designs which seem to be the future. Most compilers still aren't good at generating AVX-512 code, and very few applications use it because it requires optimization for a very small portion of the market. With a variable length vector, everyone gets good code.

Also, I'm not clear why you think riscv style vector instructions will perform worse on 2-4 length vectors.

astrange · on May 22, 2022

> Also, I'm not clear why you think riscv style vector instructions will perform worse on 2-4 length vectors.

Assumes there’s no switching cost to changing it and there won’t be restrictions on the values it can be set to. Weird restrictions on memory loads are pretty traditional for SIMD instructions after all.

The switching cost is the biggest problem; this is exactly the kind of thing that x86 has specified, then found they can’t do without microcoding it or otherwise making it cripplingly inefficient on different HW generations.

janwas · on May 23, 2022

> the much bigger problem with SIMD is that it is really hard to program for since vector sizes keep getting bigger This can be fixed with an abstraction layer such as github.com/google/highway (disclosure: I am the main author). The programming model is variable-length vectors and we can map that both to SSE4/AVX2/AVX-512 and SVE/RVV. It seems to me that the actual pain (of large numbers of unique instructions in the ISA) is pushed to the CPU decoder, binutils, and abstraction layer.

> doesn't look good at all for big-little designs which seem to be the future Why not? It's not clear to me why OSes shouldn't provide a "don't migrate me while I'm checking CPU capabilities and then running this SIMD/vector kernel".

> very few applications use it [AVX-512] because it requires optimization for a very small portion of the market. Maybe so (though increasing), but the cost is also low - we literally just pay with binary size and compile time (for the extra codepath).

fulafel · on May 26, 2022

Compilers and languages have had a very long time to work on autovectorizing existing code for SIMD (54 years since ILLIAC, 25+ since SPARC VIS and MMX from Intel in 1994/1997). It mostly doesn't work for C/C++. The language has to support it, other optimization opportunities will be constantly blocked by lack information and/or restrictive specified semantics of operations.

The potential of transparent SIMD use by compilers has apparently stayed marginal enough that no mainstream application languages in all that time have updated towards facilitating it.

d_tr · on May 22, 2022

I am not taking any hard stance on the usefulness of the specialized instruction, but M1 is a very wide and powerful core, so this won't be true everywhere.

The single instruction might also be more power-efficient and keep other resources free for other stuff.

brigade · on May 22, 2022

Specifically, the M1's NEON ALUs can handle 64 bytes per cycle on the big cores. Most Cortex can only do 32 bytes per cycle (Cortex X1/X2 is the exception), and the lower end designs are only 16B/cycle. Plus the pmull+eor fusion on M1 increases effective throughput by 33%, and I don't know if that's implemented on any Cortex.

Without those, the NEON version would fall to the same throughput as one CRC32X per cycle, or half the throughput on cores with 2x64bit ALUs.

athrowaway3z · on May 22, 2022

Also not taking hard stances, but both cases are suspect. Power efficiency because being 3 times faster means you're done 3 times earlier. Keeping other resources free because I suspect a CRC calculation is generally followed by an `if eq` statement. ( Even with out-of-order or speculative execution this creates a bottle neck that is nice to remove 3x faster )

stingraycharles · on May 22, 2022

If you’re writing optimized code, hardly ever would you evaluate one CRC check at a time. You would process them in chunks, as the OP stated, but would just let a compiler do the auto-vectorization.

This is even more true in the case of CRC, where there’s clearly almost always one branch that wins: this is perfect for branch prediction, which would mean the whole “if eq” condition is preemptively skipped.

saagarjha · on May 22, 2022

The compiler probably isn’t going to be able to autovectorize a CRC unless you help it out.

dottrap · on May 22, 2022

I think power efficiency has a lot more variables now so it is not easy to know if consumption is linear with time. CPUs now dynamically throttle themselves, plus now Apple has advertised that its M1 cores are divided up between high-performance and high-efficiency efficiency cores, let alone how the underlying chip itself may consumer power differently for implementing different instructions.

So for a hypothetical example, it could be that using general purpose SIMD triggers the system to throttle up the CPUs and/or move to the high performance CPUs, whereas the dedicated CRC instructions might exist on the high-efficiency cores and not trigger any throttling.

I've forgotten all my computer architecture theory, but if I look back at Ohm's law and look at power, the equation is P = I^2 • R. Handwaving from my forgotten theory a bit here, ramping up the CPUs increases current, and we see that it is a squared factor. So by cutting the time by say a factor of 3 does mean you are done 3 times faster (which is a linear component), you still have to contend that you have a squared component in current which may have been increased.

I have no clue if the M1 actually does any of this, but merely stating that it is not obvious what is happening in terms of power efficiency. We've seen other examples of this. For example, I've read that Intel's AVX family instruction generally increases the power consumption and frequency of when utilized, but non-obviously, it often runs at a lower frequency when in 256 or 512 wide forms compared to the lesser widths (which then requires more work on the developer to figure out what is the optimal performance path as wider isn't necessarily faster). And as another example, when Apple shipped 2 video cards in their Macbooks, some general purpose Mac desktop application developers who cared about battery life were tip-toeing around different high level Apple APIs (e.g. Cocoa, Core Animation, etc.) because some APIs under the hood automatically triggered the high performance GPU to switch on (and eat power), while these general purpose desktop applications didn't want or need the extra performance (at the cost of eating the user's battery).

saagarjha · on May 22, 2022

> whereas the dedicated CRC instructions might exist on the high-efficiency cores

M1 has a heterogeneous ISA, FWIW.

astrange · on May 22, 2022

Homogenous? The P and E cores have all the same instructions. You won’t get suddenly moved from one to the other or hit emulations.

saagarjha · on May 22, 2022

Oops, yes, that's what I meant. Thanks for catching that!

pclmulqdq · on May 22, 2022

It is not faster to use SIMD by hand - it is faster to use the vector unit alongside the integer unit, using both paths at the same time.

userbinator · on May 22, 2022

That's very shortsighted thinking. The dedicated instruction could be optimised by the hardware in a future revision to become much faster.

dragontamer · on May 22, 2022

SIMD is a very powerful parallelization technique, with marvelous gains whenever I see it used. It seems like a fundamentally more efficient form of compute, but is very difficult to design algorithms for.

I'd argue against "SIMD" as being "RISC", since you need all sorts of complicated instructions (ex: gather/scatter) to really support the methodology well in practice.

tremon · on May 22, 2022

But scatter/gather is a primitive operation for SIMD, so if you want a RISC-based version of it, that's exactly what you would provide. Having dedicated instructions for specific operations (whether for crc/aes/nnp or whatever) feels like a CISC-based approach, so I think I agree with the GP.

RISC vs CISC is about the simplicity of the instruction set, not about whether it's easy to use.

mhh__ · on May 22, 2022

These days I'd argue risc vs cisc is more about regularity and directness than the size of the ISA as per se.

I'd argue AArch64 isn't particularly RISC by the standards of the past but it sets the bar and tone for RISC today.

dragontamer · on May 22, 2022

And which SIMD instruction set should we be talking about? NEON-instructions or with the SVE instruction set?

And if we're talking about multiple instruction-sets designed for the same purpose, is this thing really RISC anymore? Or do you really mean "just not x86" when you say RISC ??

mhh__ · on May 22, 2022

That depends on how precisely you define the purpose. NEON and SVE seem to be aimed at different intensities of work.

dragontamer · on May 22, 2022

> But scatter/gather is a primitive operation for SIMD

Not in NEON, and therefore not in M1. AVX512 and SVE add scatter/gather instructions.

Intel/AMD's AVX has vgather instructions, but is missing vscatter until AVX512.

> Having dedicated instructions for specific operations (whether for crc/aes/nnp or whatever) feels like a CISC-based approach, so I think I agree with the GP.

Not only are there AES instructions on ARM, but there's also SHA-instructions. The mix-columns step of AES more or less demands dedicated hardware if you want high-speed today, so everybody implements that as a hardware specific instruction.

fulafel · on May 23, 2022

What RISC is about in today's context is somewhat hazy and there are lots of views and opinions about it. RISC was a branding of a shift that happened in the 70s and 80s.

It was a time of tight transistor budgets, and there was a make-or-break sweet spot of instruction set size and complexity that could be hardcoded in the ~100k transistors (vs microcoded[1], as was the norm).

IMO the fundamental idea was that you should optimize the instruction set for your applications to a certain extent, while still keeping it coherent for humans and stable across processor generations.

RISC still gives ease of use weight, without this your instruction set might be a cryptic machine learning box of mystery operations with bizarre temporal/hidden state semantics (think delay slots but much much worse) that gets rebooted on every CPU release.

[1] See eg this Motorola 68000 internals description on how the microcoded thing worked: http://www.easy68k.com/paulrsm/doc/dpbm68k1.htm

tlb · on May 22, 2022

Similarly, I wish that on x86, REP STOSB was the fastest way to copy memory. Because it only takes a few bytes in the icache. But fast memcpys end up being hundreds of bytes, to work with larger words while handling start and end alignment.

StillBored · on May 23, 2022

The real problem (and with the crc above) is that the fastest version for any given cpu may not be the fastest for any other. Its really short sighted to not spend the area on some of these features (aes, crc32, memcpy) because invariably one ends up with a long term optimization problem where in 5-10 years any given application has to run on on of a half dozen diffrent CPUs and optimizing for -mtune=native, and it likely results in suboptimal perf one the lastest CPUs because the micoarch designers can't be constrained to assuring that the the newer version runs any given instruction sequence proportionally faster than the previous. (aka the overall perf may go up but maybe something like the nontemporal store, or the polinomial mul doesn't keep up).

And this is really the CISC vs RISC argument and why all these RISC cpus have these CISC like instructions. You want top perf in general code you assure the rep sto and mov sequences (or whatever) run the fastest microcoded version possible on a given core. But intel sorta messed this up in the p6->nehalem timeframe (IIRC when they added the fast string flag) until they rediscovered this fact. IIRC Andy Glew admitted it was a bit of an oversight combined with an release/area issue on the original PPro they intended to fix, but then it took 10 years.

userbinator · on May 22, 2022

It still is in general situations (i.e. not the microbenchmarks where the ridiculously bloated unrolled "optimised" implementations may have a very slight edge.) I believe the Linux kernel uses it for this reason.

jabl · on May 22, 2022

The kernel is a bit of a special case since very likely a syscall starts off with a cold I$, and also there's a lot of extra overhead if you insist on using SIMD registers.

In general I agree with you though, optimizing memcpy implementations only against microbenchmarks is dumb.

stncls · on May 23, 2022

Also, using SIMD registers is (generally) forbidden in kernel code, which heavily narrows down the competitors to "rep stosb".

saagarjha · on May 22, 2022

With ERMS it’s definitely not going to be slow, so it’s a good choice when you’re in a constrained environment (high instruction cache pressure, can’t use vector instructions).

rowanG077 · on May 22, 2022

It's not really a fair comparison. There is only one CRC32 unit which means it can't make use of superscalar (at least if I understand the article correctly). If it would have more CRC32 units that would be the most efficient.

MichaelZuo · on May 22, 2022

It's very impressive someone messing around for a few hours could get the m1 chip to more than 2x the performance. Easy gains like that really shouldn't be possible, assuming Apple's silicon team are competent. Maybe there's some hidden gotcha here?

Sirened · on May 22, 2022

This is way more common than you'd think, and it's not by accident. Engineering teams optimize the paths that are heavily used to get the biggest improvement across the platform as a whole. CRC32X is certainly not as heavily used as NEON and so if you're forced to decide between spending area on being able to fuse extra instructions for NEON and slightly improving throughput for CRC32X, the obvious choice is NEON. You see this way more obviously on Intel's x86-64 cores where many of the highly used instructions are fast path decoded but some of the weirder CISC instructions that nobody really uses are offloaded to very slow microcode.

bee_rider · on May 22, 2022

I wonder -- could CRC32X be something that would also, specifically, not as interesting for Apple? They are mostly optimizing for desktop workloads. I wonder if worrying about checksuming, especially maximizing the throughput of checksum operations, is more of a server thing. (Like we have to checksum when we download things on desktop, but that's a one-off, and I guess things get checksummed in the filesystem, but even the nice NVME drives are pretty slow from the CPUs point of view).

astrange · on May 22, 2022

MachO uses codesigning with adhoc signatures as a form of checksumming, and there’s also TCP and whatever drives do. So it’s the converse, it’s so common the dedicated hardware does it instead of the CPU. And it’s not all the same algorithm.

Sirened · on May 23, 2022

I think it's a little bit of that and also I suspect they have far more pressing concerns than CRC32X being relatively slow (it is still a throughput of one per clock which isn't at all bad). Branch prediction and prefetching seems to be the really important problem at least for Apple due to their very deep ROB [1]. A mispredicted branch being resolved late (i.e. a branch dependent on an outstanding DRAM fetch) can lead to hundreds of executed instructions being discarded (wasting tons and tons of power and cycles). I don't quite remember the exact figure, but I've heard a good metric in CPU arch is that about one of every six instructions is a control flow instruction in general purpose programs (i.e. non-scientific/ calculation heavy). Being just a little bit faster on CRC32X calculation may not have been worth it when they could spend that precious power budget elsewhere. It's really just design choices all the way down. They may very well be doing a lot of CRC32s but they're almost certainly doing more of everything else than CRC32s.

[1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

AlotOfReading · on May 22, 2022

Intel's algorithm is very clever and would take a lot of space to implement in hardware. The implementation underlying the CRC32** instructions is probably some set of shift registers. That's a pretty good space/speed tradeoff to make.

My largely uninformed guess is that they added the instructions to get fast CRCs for the filesystem 'for free'. There aren't many other cases where software CRC can be a bottleneck that also use these polynomials.

dzaima · on May 22, 2022

CRC32X works on 8 bytes at a time and has a throughput of one invocation per cycle, whereas the SIMD operates on blocks in parallel (the chromium code does 64 bytes an iteration, with a lot of instruction-level parallelism too). Theoretically M1 could have thrown more silicon at it to allow more than one CRC32X invocation per cycle, but that's not very useful if you can achieve the same with SIMD anyway.

naniwaduni · on May 22, 2022

> Easy gains like that really shouldn't be possible,

Easy gains are everywhere. The "gotcha", if you can call it that, is that optimizing particular operations comes with space tradeoffs that are more expensive when you do them in hardware.

nicoburns · on May 22, 2022

I feel like CRC32 may be simple enough (and close enough to the kind of operation like adding and bit-shifting that general-purpose CPUs are good at anyway, that perhaps it doesn't benefit as much from dedicated silicon as other algorithms would.

interestica · on May 22, 2022

Saving it for M2 to have something to show off?

terrelln · on May 22, 2022

Could you combine both techniques to run both the SIMD version on some chunks and the crc32 instruction on other chunks, in parallel? Of course this would only work if they execute on different ports.

AlotOfReading · on May 22, 2022

Yes, and ZLib provides an excellent example of how to do this (it's called braiding in the source code). There's an additional initialization and combination cost though. It doesn't really make sense for short messages.

terrelln · on May 22, 2022

Thanks for the pointer, will have to take a look!

dougall · on May 22, 2022

Hmm, yeah, this might work out... Two SIMD uops process 16 bytes, so each SIMD uop is doing eight bytes of work - the same as CRC32X, but with more frontend pressure (and preferable because they can run on any of the four SIMD ports, not just the one distinct CRC32X port).

It gets a bit messy, and we can't expect a ton from this approach - the same loop with only the loads only runs at ~86GB/s, but it'd be worth a shot.

sgtnoodle · on May 22, 2022

It seems like CRC inherently depends on results from earlier calculations, so it would be hard to parallelize like that. You could potentially do multiple independent CRC calculations in parallel, but then you're getting into more niche use cases.

aaaaaaaaaaab · on May 22, 2022

Wrong. CRC is just polynomial division, which is simple to do in a divide and conquer fashion. It's pretty easy to derive CRC(A concat B) from CRC(A) and CRC(B). It needs a multiplication and a XOR.

sgtnoodle · on May 23, 2022

ah, that's pretty cool! It looks like concatenation is a O(log(n)) operation involving appending a bunch of zeros rather than just a multiplication, though?

aaaaaaaaaaab · on May 23, 2022

It depends.

(A(x) mod Q(x)) * (B(x) mod Q(x)) = (A(x) * B(x)) mod Q(x)

If the chunk size N is known beforehand you can pre-calculate x^N mod Q(x), so appending N zeros will be an O(1) multiplication.

Only if the chunk size is not known, you have to calculate x^N mod Q(x) via modular exponentiation, which is O(log n). But you only need to do this once, and then you can reuse the value for all subsequent chunks.

parentheses · on May 22, 2022

It is very interesting that since the release of the M1 chip, CPU performance on Apple silicon has really come under a microscope. It leads me to ask:

Was Apple silicon always this best-in-class and we weren’t looking this closely as a community?

minhazm · on May 22, 2022

Apple's chips in their iPhones & iPads have been outperforming the competition (Qualcomm & Samsung) for a long time now in both power efficiency and performance. Apple has usually been around 2 yrs ahead of the competition. The Qualcomm Snapdragon 888 chip with 8 cores has a Geekbench multi-core score of 3592[1]. The six-score Apple A15 Bionic scored 4673. The Apple chip has ~30% better multi-core performance with 25% fewer cores than the Qualcomm chip. In single-core performance the difference is even larger, with the Apple chips performing ~47% faster. You can get the same A15 Bionic in both the $429 iPhone SE and the $1100+ iPhone 13 Pro Max.

It hasn't really been a huge deal though because people don't develop directly on an iPhone, so it doesn't affect their every day productivity all that much. Also phone's have reached the point of "fast enough" a few years ago, it's hard to tell the difference between an iPhone 13 Pro and an 11 Pro unless you use them side by side. But with the release of the M1 chip, people are getting the performance & energy efficiency gains in their every day workflows.

[1]. https://browser.geekbench.com/android-benchmarks

[2]. https://browser.geekbench.com/ios-benchmarks

grishka · on May 23, 2022

Phones also usually aren't designed to sustain a high CPU load for extended periods of time. Computers, on the other hand, are.

Aissen · on May 22, 2022

Everyone was looking, and it was well known that it was best in-class; two random examples: https://www.anandtech.com/show/7335/the-iphone-5s-review/4

https://twitter.com/codinghorror/status/912047023871860737

skavi · on May 22, 2022

Yes, and anyone who had been reading AnandTech's excellent mobile reviews (by Andrei Frumusanu) knew what M1 would bring.

mhh__ · on May 22, 2022

You couldn't buy one without a screen, or run arbitrary code on them.

DeathArrow · on May 22, 2022

I wonder how fast can someone get it to run on Intel's 12 generation core CPU.

It seems a good idea to start a Code Golf competition.

Twirrim · on May 22, 2022

Likely as much noise as signal, but anyway:

I'm using an 8th(?) generation Intel, i7-8665U.

https://github.com/htot/crc32c has some interesting implementations of CRC32 algorithms of different speeds, the highest I see is (function, aligned, bytes, MiB/s) :

    crc32cIntelC     true 16 3907.613
    crc32cIntelC     true 64 15096.758
    crc32cIntelC     true 128 24692.803
    crc32cIntelC     true 192 22732.392
    crc32cIntelC     true 256 16233.397
    crc32cIntelC     true 288 16748.952
    crc32cIntelC     true 512 19862.039
    crc32cIntelC     true 1024 22373.350
    crc32cIntelC     true 1032 22482.031
    crc32cIntelC     true 4096 24690.531
    crc32cIntelC     true 8192 24992.827

So pushing 25GiB/s on a 3ish year old CPU.

wtallis · on May 22, 2022

FYI, CRC32C is not the same checksum as CRC32. CRC32C is used in iSCSI, btrfs, ext4. CRC32 is used in Ethernet, SATA, gzip. Intel's SSE4.2 provides an instruction implementing CRC32C but not CRC32. ARM defines instructions for both.

This article seems to miss that distinction but appears to be testing CRC32, so it's not quite correct to compare against something using Intel's CRC32C instruction.

jeffbee · on May 22, 2022

Core i7-12700K:

  crc32cIntelC     true 64 26210.561
  crc32cIntelC     true 128 35870.309
  crc32cIntelC     true 192 36850.224
  crc32cIntelC     true 256 30343.690
  crc32cIntelC     true 288 30671.327
  crc32cIntelC     true 512 32443.251
  crc32cIntelC     true 1024 34654.719
  crc32cIntelC     true 1032 34265.440
  crc32cIntelC     true 4096 38111.089
  crc32cIntelC     true 8192 38634.925

neurostimulant · on May 22, 2022

I'm using i7-4790 (7 year old cpu) and the numbers are slightly better. Maybe because it's a desktop.

    crc32cIntelC        true    16  4025.334
    crc32cIntelC        true    64  15749.095
    crc32cIntelC        true    128 26608.064
    crc32cIntelC        true    192 25828.486
    crc32cIntelC        true    256 17448.436
    crc32cIntelC        true    288 18336.381
    crc32cIntelC        true    512 22635.590
    crc32cIntelC        true    1024    24654.248
    crc32cIntelC        true    1032    24180.107
    crc32cIntelC        true    4096    28251.903
    crc32cIntelC        true    8192    28768.134

jeffbee · on May 22, 2022

I don't think they will hit 75GB/s because Intel doesn't make that kind of memory throughput available to a single core.

fwsgonzo · on May 22, 2022

https://github.com/komrad36/CRC

I measured it on my computer to run at 32GB/s, i7 4770k.

dragontamer · on May 22, 2022

https://www.intel.com/content/dam/www/public/us/en/documents...

IncRnd · on May 22, 2022

It's always important to know which crc you are using.

Looking at https://developer.arm.com/documentation/ddi0596/2020-12/Base..., based upon the polynomial constant, the CRC32 class of instructions appear to calculate a CCITT 32 reversed polynomial. Are there any ARM developers who can help me out here? Does this apply in the same way to the M1?

dougall · on May 22, 2022

The post glosses over it a bit - the CRC32X instruction always uses the common polynomial 0x04C11DB7 (matching zlib, commonly just called CRC-32), and there's a second instruction, CRC32CX, which is the same but uses the polynomial 0x1EDC6F41, known as CRC-32C (Castagnoli). The constants in the post are also for 0x04C11DB7, but the linked Intel article explains how to they can be calculated for arbitrary polynomials, so the faster method is also generic, which is nice.

IncRnd · on May 23, 2022

Thank you!

The Castagnoli polynomial, 0x1edc6f41, is used to compute a crc in Btrfs, Ext4, iSCSI and various other places.

13of40 · on May 22, 2022

Weird, I talked to the guy who invented the crypto scheme for ZIP and he said he invented the CRC algorithm for it as well. I wonder if there's more backstory there.

Edit: He invented the crypto not the CRC, which Phil Katz was already using.

AlotOfReading · on May 22, 2022

That's a very strange claim. As far as I know, the same CRC has been used for ZIP since it was invented by the PKWARE guys in the 80s. Moreover, no one deeply understood the properties and tradeoffs of various CRCs the way we do now, so everyone used largely identical algorithms with only trivial variations. Phil Katz did the same and reused the same polynomial that was in ethernet and dozens of other standards, which in turn had originated from this 1975 report: https://apps.dtic.mil/sti/pdfs/ADA013939.pdf

He wasn't even the first to put a CRC in an archive format, as the predecessor format ARC had a CRC-16 doing the same thing.

13of40 · on May 22, 2022

Ah, OK, I looked in my email and it was the guy who invented the encryption scheme, but he said "Yes, I invented it [referring to the encryption]. It wasn't based on anything else, except that it used the same CRC he [Phil] was already using in zip." (As a historical note, he also said the crypto scheme was intended to be exportable, which at that time meant "intentionally weak".)

carbonbee · on May 22, 2022

I think what OP meant to write: the zip encryption algorithm is a custom stream cypher that uses crc32 as the main building block.

(It's a very bad cypher, vulnerable to known plaintext and other attacks, don't use it for anything except light scrambling).

mappu · on May 22, 2022

Nowadays most zip programs will default to using AES (in the way WinZip invented) instead of ZipCrypto.