A correction to the history mentioned in the article: CMOV has not been added by AMD around 2000.
CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.
Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.
The most important bit of this is in the conclusion:
Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.
Linus famously ranted about CMOV in https://yarchive.net/comp/linux/cmov.html (2007, so potentially more modern architectures are better at some of this) and he says:
if you KNOW the branch is totally unpredictable, cmov is often good for
performance. But a compiler almost never knows that.
As usual with optimizations like this you have to benchmark and even then if your sample isn't representative it might not mean much.
Notice how random_s95 is worse (not by much, but it's there) than fully random. random_s95 is 95% sorted data, but 5% unsorted, simulating a common "sort, do stuff, append, repeat" pattern we see in a lot of software.
In contrast the sorted cases are almost instant, and random_d20 (only 20 distinct values, chosen at random, but as a result the sorted output needs much fewer comparions) is very fast.
When Linus made that comment cmov was like a 6 cycle latency. For the last decade it has been 1 cycle, and I don't think there is any scenario where cmov is now slower than a branch.
The problem is not the 1 cycle latency, but the data dependency on both values. A correctly-predicted branch cuts the dependency on the value that is not used.
I've definitely measured scenarios where cmov/branchless is slower than a branch for a given algorithm. Especially if the branchless version is doing a bit more work to avoid the branch.
It is both though. At 6+ cycles there are only a few places where CMOV is a win. At 1 cycle you can be more liberal with its use and the lack of dependency breaking is a tradeoff.
cmov turns a control dependency into a data dependency. It is still there just in a different way. (its like the Fourier Transform of instructions - lolol).
I haven't seen cmov lose to a conditional in a few years (on very recent Intel hardware). Maybe there might be some circumstances where in creates a loop carried dependency that wrecks the optimizer, but I haven't seen one in a minute.
To be fair - cmov versions often involve some extra compution to get things in a form where cmov can be used, and I often just opt for the well predicted conditional since those couple extra instructions to set up the value to move can be avoided.
Good point. It makes me wonder if modern out-of-order processors can skip performing unused computations altogether, if their result registers are overwritten by other data later (in program order).
CPUs could in theory speculate CMOV, reintroducing prediction when predictable. But after Spectre, IIRC Intel now guarantees that CMOV is never speculated.
a cmov-based approach is necessarily slower than a branch-based approach that was correctly predicted, since cmov requires computing both branches then selecting the result at the end.
Per Fog's tables, Ice Lake shows a reg-to-reg cmov decoding to 1 uop with a latency of 1 and throughput of 0.5. *mont all have latency 2. P4 had a latency of 6. AMD's K10 latency 4.
Linus was specifically ranting about compilers inserting CMOV. These days it is actually a pain to get GCC or clang to generate a CMOV when you specifically want it.
Intel was so embarrassed by the failure of itanium they invented a Time Machine and went back and added the instruction to a 1995 CPU. Deceptive and anti-competitive!
Rather than std::swap_if which is rather niche, I would prefer to see std::select(cond, if_true, if_false), with the guarantee that unlike the ternary operator it eagerly evaluates both arguments and selects between them branchlessly if possible.
Wouldn't the sensible lower level primitive be Clang's __builtin_unpredictable? If you standardize that then a librry can easily build your select function on top of it. I guess GCC's __builtin_expect_with_probability with probabilily 0.5 should have the same meaning.
Also the compiler vendors have seen this movie before and know how it ends. Too often a "performance hint" is garbage and ignoring it will improve performance.
Which is unfortunate because these are genuinely sometimes useful, but it's a "We can't have nice things" situation - for every expert who is carefully applying a hint to a hot path after careful reading of the manual for the specific CPU model they use, there are dozens of idiots who will copy-paste that hint believing that it's magic speed-up juice, never having measured and with no idea what it's really doing or why.
The intrinsics have the advantage that at least you're expressing what you meant, and it is obvious whose fault it is when that's slower because you screwed up - hints make it too easy for programmers to justify a tantrum when their "hint" isn't magic speed-up juice after all.
For what it's worth, I recommended against mentioning `unpredictable` anywhere in the name when we stabilize the intrinsic.
IMO the constant-timeness is much more important, e.g. in audio DSP code or for cryptography you really care that it's a `cmov`, regardless of the predictability of the predicate.
Compiler vendors are the ones who add both the intrinsics and attribute hints. I also don't see a reason why "programmers" who engage in wishful thinking would limit that to hints and not apply the same to intrinsics.
Communication is the reason. "Hints" are not good communication.
There's a real difference between the wishful thinking which leads to programmers writing incorrect hints and them writing code that outright says what it's doing and then being surprised it did what they asked.
This sort of use of turning conditionals into bitmasks and then shuffling masked values around was fairly common back in the day when writing code for SSE intrinsics.
You wouldn't want to have to check each lane individually since that would spoil the point of using SSE. So you'd write masking ops like this and keep things parallel, tThen judiciously use any() and all() type intrinsics to check whether the lanes could agree to skip certain blocks conditionally.
(This was before ISPC and before autovectorization to handle it for you.)
If you look at the assembly code produced, you'll see in lines 9-11, it's doing a similar thing as the swap_if() function: ANDing one value with the mask, ANDing the other with the negation of the mask, and then ORing them together. And it produced that automatically from the ternary in the loop.
They can also do things like automatically generate epilogues with a serial loop to handle the leftover items if the loop count isn't an even multiple of the SIMD width. That's something else that I remember writing by hand before autovectorizers.
That said, coaxing an autovectorizer to actually vectorize a loop can be an art in itself. They're notoriously finicky. My favorite used to be the one in the old ("classic") ICC compiler where you use compiler flags to get a nice optimization report which would explain exactly why it chose what it did for each loop. You could also set #pragma's which would noisily warn or error out if the associated loop failed to vectorize for some reason.
(And of course, a lot of this stuff you might just use GPU compute for now.)
You can get gcc to generate the blend instruction in the example given - you just have to 'force' it to use SIMD more aggressively by, for example, using -march=native.
Assuming the mask is passed in, Godbolt puts OP’s swap_if at 26 instructions, versus the above swap_if at 17 instructions: https://godbolt.org/z/Wedco5hPv
You have the optimizer disabled and the functions are no-ops. Additionally, masking that way makes the XOR no longer a no-op, so only one masking operation is necessary -- but it seems that GCC already knows this trick:
> The in-place version of swap is generally discouraged because compilers are smart
Isn’t that more because CPUs slow down when there are dependencies between instructions?
Compilers could (and may even do) fairly easily detect that pattern (like they do with some versions of popcnt. See for example https://langdev.stackexchange.com/questions/3942/what-are-th..., discussed here in https://news.ycombinator.com/item?id=40987123) and compile it to whatever is fastest on the target CPU/in the given context (in some contexts, it can be compiled to nothing, just changing the mapping between local variables and the registers they’re stored in)
Will we see x86 or similar CPUs replace hot instruction blocks with current-processor-specific optimized code during runtime, similar to how certain JIT VMs do?
What I mean is, say you have a similar simple "x = (cond) ? a : b;" which the compiler has not translated to a CMOV.
If this is in a hot loop then the CPU could, in theory, notice that "it's just doing a conditional move, I can do the CMOV faster" and then translate those code bytes at that memory location to a CMOV instead (ie during decoding or something like that).
Not worth the complexity? I imagine it would slow down the decoder or wherever they insert this replacement logic. Or am I hopelessly out of date and they're already doing this?
I think it could only do that in spans of code where interrupts are disabled and purely between values already in registers. CPU state feels like it's in your control, but it's not at all in your control, and even if it was, multiple processes might exist and memory is never in your control.
Whether to use a branch or a conditional move isn’t really dependent on what the source is doing but how likely the branch is and its dependencies. Simplifying a bit, an explicit cmov is basically telling the computer that the condition is not easy to predict and to not bother. Modern processors will typically do the same analysis themselves and perform similarly on a branch with a slight overhead.
> Will we see x86 or similar CPUs replace hot instruction blocks with current-processor-specific optimized code during runtime, similar to how certain JIT VMs do?
Kinda, though Transmeta was built to translate everything. I must admit though I had forgotten some details, and you're right in that my idea is not unlike bolting the Code Morphing stuff to a regular x86 core.
> I call this bog-standard, but the quicksorts most easily found online are always oddly pessimized, both more complicated and slower. This one uses the “Lomuto partition”, which is simpler than Hoare’s.
The Lomuto partitioning is a naive vandalism of Hoare's original, which causes quicksort to have quadratic behavior on sorted input.
Fascinating! How does one learn to write C code that will make use of specific asm instructions?
SIMDjson is another example that comes to mind. The conceit of C is that you do t have control over the underlying machine instructions without inlining it yourself. So how do people write C code with cpu optimizations like this?
1) Use intrinsics, if your platform provides intrinsics for the instructions you want to use.
2) Use inline assembly (which I suppose doesn't technically count as writing C code, but is very much part of the story).
3) Write C code carefully against a chosen compiler with chosen optimization flags, and inspect the assembly. Especially for small chunks of code, a tool like godbolt.org is indispensable. Basically, you're writing the assembly you want the compiler to generate (either explicitly, or just in your head), then writing C code that seems likely to generate that sequence of instructions, then tweaking the code until the output you want is generated. If this is a super important optimization, it's also reasonable to add a build step that inspects the generated assembly and fails the build if it doesn't match the desired pattern; but in that case, writing inline assembly is usually easier.
Option 4 is sometimes compiler annotations like __builtin_expect_with_probability which you could use to assign a branch a value of 50% which should coerce it to pick cmov (same risk as 3 though in that you need to inspect compiler output).
Yep, there's an entire book to be written on techniques that you can use to express your understanding of code to the compiler so that the compiler's chosen implementation of it matches yours. Plenty of __builtins and __attributes__, and even just the ordering of if/else statements, grouping statements into local scopes with {}, using or not using a temporary variable...
I would say it is a moving target if the goal is to write C code that compiles to a seemingly magical cpu instruction. As other have pointed out, learning assembly and compiler explorer are useful things.
The gist is rather simple: Assuming that a substiantial amount of your searches results in no result, you can bias the binary search to jump farther then just half, improving runtime noticeably.
If your primary audience is other developers then it absolutely does not matter. In fact it’s a way to expose deep anger. All that matters is convenience and measurements just piss people off, because measuring anything is just too hard.
If your primary audience is business, especially transactional finance, then absolutely yes. Squeeze absolutely every millisecond out, measure absolutely everything in terms of transaction quantity, focus on concurrent operations.
Convenience is the kind of thing a man values when he isn't pushing against any fundamental limits imposed by the universe. You have to work on computing frontiers like LLMs where people are feeling a performance crunch. Measuring things is what separates software engineering from computer programming. Computer science degenerates into social science when your computer is 100x faster than the thing you're building, since then you have no need to measure things. Without extreme self-imposed discipline, your work in that kind of environment will spiral into bloated decadence, unless you move on to the next computing frontier.
Note that random data is not a common case for sorting algorithms. It'd be interesting to see how the numbers change on partially-, mostly-, and fully-sorted data.
> What can we conclude from this discovery? Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.
it's true that it's not a common case, but it's probably the simplest objectively justifiable case; any particular case of the others requires more elaborate justification for privileging it. why 90% sorted instead of 80%, say? or why do we give 20% weighting to the randomized-data case and 80% weighting to the 90%-sorted case, rather than some other weighting?
and it does at least guarantee exploration of a good part of the algorithm's behavior space. i mean, bogosort is optimal on fully-sorted data, right? as long as you do the check for sortedness before the random shuffle instead of after it
it's important that your sort not explode on mostly-sorted or mostly-reverse-sorted data (myers's 'bog-standard quicksort' uses the last element as the pivot and consequently goes quadratic), but beyond that, the particular weighting of relative importance to assign to random input vs. 99% sorted vs. 90% sorted vs. 90% reverse sorted—that requires some application-specific justification
aside from the ones you mentioned, another interesting case is a large array of records that all have equal keys. median-of-three quicksort or random-pivot quicksort does fine on mostly- or fully-sorted data, but still sucks when everything is equal!
It’s weird to admit that this is an arbitrary dataset that doesn’t mimic real-world use cases, then admit that radix sort is far faster for this situation, then conclude that low-level optimizations are important and a cmov-wrapper should be added to the standard library to make programs faster. It’s a great post I just disagree with the conclusion.
The radix sort implementation is not optimal. Instead of sorting 1 digit at a time, it should be sorting 11 digits at a time to saturate the cache lines.
Which of course makes it ironic that it says "We ought to be able to sort pretty fast, by now, using a Standard Library sort."
The C++ standard promised (since 1998) that the provided sorts are O(n log n).
But in fact popular C++ implementations ignored this and shipped quicksort - after all it's "pretty fast" but unfortunately this means if bad guys can choose your input (or you get unlucky) you get O(n squared)
Gradually they fixed this, libc++ updated to Introsort (last century's best attempt to do this) in 2021. In response to a ticket opened by Orson Peters.
These days you're also starting to see a trend towards the Rust-style "fast sort which implements a wide contract" design even in C++ implementations. Rust's sort doesn't actually promise this wide contract arrangement, but it does promise safety and the wide contract turns out to be the simplest way to do that.
What I mean there by "wide contract" is that these sorts don't blow up if your ordering rule is nonsense. They can't "sort" successfully because your ordering rule is nonsense, if A > B && B > A then too bad, we can't sort A and B. But they can and do gracefully give up, while many older C++ stdlibs just cause uncontrolled mayhem in this case.
I'm pretty sure that introsort was already implemented in the original STL last century.
In fact the standard didn't require O(log n) untill C++11 as all implementations were already using introsort, so allowing the previous more permissive O(n^2) bound was no longer necessary.
I don't know which is the "original STL" in this context. Alexander Stepanov's proposal of a standard template library for C++ pre-dates Musser's invention of introspective sorting (introsort). Stepanov is a smart guy but he's also somewhat famous so I'd guess if he invented it first we'd know that.
As I said, Orson filed a bug in LLVM bugzilla for libc++ going quadratic for adversarial input in 2014, and it was eventually fixed (by using introsort) in 2021.
There's a fun sideshow in that ticket, remember Orson actually wrote what is probably the best general purpose unstable sort at the time - but all he's done is open a ticket saying libc++ doesn't meet the minimum requirements. After a while, with the team having dragged their feet and still not even landed the patch to just use introsort somebody pipes up to report that they have benchmarked sorts and actually libc++ is faster than other popular C++ sorts (in their benchmark). So, Orson politely asks them to try PDQsort. That's the first time he explicitly mentions his Pattern Defecting Quicksort in the ticket, and maybe a harried LLVM bug handler wouldn't notice who filed the bug until then. Just another name.
But it's OK, the LLVM devs are unperturbed and ignore him, focusing instead on the minutiae of the performance numbers for their known defective unstable sort. Year later somebody eventually adds introsort to libc++ and closes the defect ticket. Nobody answers Orson's question, the answer of course is "Oh, yours is much faster".
Musser and Stepanov were working together on the STL. I think they were both at SGI at that time, and introsort was integrated into the SGI STL (around the same time the STL was standardized, I guess introsort was too new to require it in C++98); from there introsort made it into most other standard library implementations which are directly derived from SGI STL.
libc++ is notable for being one of the few from-scratch implementations.
CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.
Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.