Hand Coded Assembly Beats Intrinsics in Speed and Simplicity

pandaman · on Oct 25, 2014

I wonder if it's already the time to put to rest the myth that the human-coded assembly for a modern CPU is slower than what a compiler generates?

As far as I know it's been semi-true around Pentium time. On that asymetrical dual pipeline architecture with many parring exceptions and some microcoded instructions in the mix the typical hand-optimized code for the previous generation CPUs would likely under-utilize the second pipe and hit some microcode for a good measure. However, even the Pentium Pro already had a very robust front end that did not require non-intuitive op rearrangement to utilize multiple exec units.

Saying that you cannot hand code better than the compiler for a non-Pentium CPU nowadays sounds similar to saying that if you use too many set bits in your data then the punch cards will warp and tear apart from too many holes.

userbinator · on Oct 26, 2014

Speaking from experience, this myth was never true from 8086 up through the 486; the Pentium (P5) might've been slightly harder to optimise for, but a human could still easily beat the compilers of the time. And of course, it is not true today either.

Around the time of the P6 (Pentium Pro/II/III) was when I think things really improved - out-of-order execution/speculation/wide decoders meant instruction ordering had little significance, and so getting more performance was more about reducing the number of instructions needed - something compilers tended to not do well at.

I believe the myth came from the early days of the RISC fad, where CPUs were designed specifically so that compilers could easily generate code for them and make human-coded Asm "obsolete"; although to me this appears to be more of a "take all the advantage away from humans, making it just as hard for them to optimise the code as the compiler"... I have only a little experience with writing Asm for MIPS and ARM, but even from that I can see they do not have anywhere near the sort of "room for optimisation" that x86 provides.

Another myth that probably needs to die now is "microcoded instructions are slow"; in a recent article on HN about using the old BCD instructions which were commonly thought to be much slower than an equivalent sequence of simpler instructions (https://news.ycombinator.com/item?id=8477254 ), I did a quick benchmark and found they were basically the same speed, but much shorter. The one caveat is that currently these complex instructions can only be decoded by one of the decoders, so having too many of them close together can slow things down if the other decoders can't find a simpler non-dependent instruction to decode; but using them sparingly could be good for making code smaller (and thus consume less cache) without any speed penalty, or possibly increase speed overall due to reduced cache misses.

yohanatan · on Oct 25, 2014

> Saying that you cannot hand code better than the compiler for a non-Pentium CPU nowadays sounds similar to saying that if you use too many set bits in your data then the punch cards will warp and tear apart from too many holes.

If one prefers a style with more set bits than not, looks like there could easily be a physical switch on the machine to inform it to globally flip the bits.

sharpneli · on Oct 25, 2014

"Well, at least in clang (3.4). Gcc (4.8.2) is too smart to fall for this separate destination thing and “optimizes” the code back to something like our original version."

This resonated quite heavily with me. I've spent rather large amount of time figuring out how to trick various compilers. This is especially critical in cases where there is no inline assembly, such as shaders. It's really sad how in some cases you can 'optimize' code by hiding information. In what sane world code runs faster if you make something not a compile time constant as an example?

I really wish that every compiler would have "Shut up! I know what I'm doing!" mode where it would basically do what you say as close as possible. 1:1 mapping on registers to certain variables and the few temp regs required for evaluating expressions.

This is not needed often but it's absolutely mandatory for going from 40-50% of theoretical speed to >90%. And in games and other soft real time applications this can spell the difference between 30 and 60 fps.

daniel-cussen · on Oct 26, 2014

> I really wish that every compiler would have "Shut up! I know what I'm doing!" mode where it would basically do what you say as close as possible. 1:1 mapping on registers to certain variables and the few temp regs required for evaluating expressions.

Seconded. When I'm testing out algorithms the compiler hasn't seen before, it takes too much work that amounts to telling the compiler "STFU and do as I say."

stephencanon · on Oct 26, 2014

That mode is called "assembly". I use it all the time.

nkurz · on Oct 26, 2014

It's a very useful mode, but assuming you don't write the entire program in assembly, what's the best way you've found to integrate the assembly with the non-assembly code? Is there any Windows/Mac/Linux portable and self-contained way of doing it short of writing assembly functions in separate files and distributing NASM along with the source? Does even that work?

J_Darnley · on Oct 26, 2014

Why would you distribute NASM, or any other assembler, with your own code? You list it as a requirement to build your code. Would you also distribute GCC?

nkurz · on Oct 27, 2014

Logically this makes fine sense, but for whatever reason, I don't think I've ever encountered a popular open source program that required a specific assembler other than 'gas' unless the entire codebase was written for that assembler. I don't think I've encountered it in academic code either.

But for whatever reason I've encountered a few that distribute NASM or another assembler. I don't know exactly why this would be. I'm trying to find out. Do you have examples where requiring NASM (or another assembler) as an external dependency has worked for projects in the past? I'm sure they must exist --- I just don't know of any.

J_Darnley · on Oct 27, 2014

FFmpeg. It requires NASM or YASM if you don't want to cripple it by disabling the external assembly.

sharpneli · on Oct 26, 2014

I especially mentioned in the the case where assembly is not available.

Examples: Shaders for graphics. OpenCL kernel language. Compute shaders.

It's the difference between getting 5TFlops or 2TFlops out of your GPU.

nkurz · on Oct 25, 2014

I feel like the problem isn't really with intrinsics. Rather, the problem is that compilers just aren't very good at optimizing for the last 30-50% of performance. One notices this when using intrinsics because if you are using them you probably are concentrating on performance. But if you were to look closely at the non-intrinsic code, I think you'd see a lot of the same problems: bad register allocation leading to unnecessary moves and spills, failure to take advantage of flags already set by arithmetic and logic operations, and awkward code layout leading to unnecessary jumps and branches.

Obviously, this isn't because compiler writers are stupid or ignorant. Instead, it's just a really difficult problem to optimize for one target in a way that doesn't hurt performance on another. Still, I think there's a lot of room for improvement, if only because comparing the output from GCC, CLang, and ICC often shows that one compiler does a much better job than the others[1]. But even once you know (definitively) that the code could be faster, it's often really tough to convince a given compiler to generate the code that you want. And coming up with a single construct that works well for multiple compilers and remains readable is often impossible.

I'd love it if there was a good way to "lock down" the output for a given function, without needing to depend on the whims of individual compilers. Other than distributing only pre-compiled binaries, there doesn't seem to be a great way to do this. Looking at Dan's example code, he has a solution that's twice as fast as the non-inline-ASM alternatives. But it's still depending on the compiler to do register allocation, and adding a global variable somewhere else in the program might well cause the whole thing to blow up, spilling to stack and triggering the popcnt false dependency that he's working hard to avoid.

What's the best way to do this --- to keep optimized code optimized across the mix of compilers in the current ecosystem?

[1] I'm Linux based, but I probably should start adding MSVC to the mix. I'm sure there are cases where it does something better than the others. Is there a good way to do this? Perhaps an equivalent to http://gcc.godbolt.org/ for MSVC?

WalterBright · on Oct 25, 2014

Generating optimal code is much like the travelling salesman problem. Optimization algorithms will lead the result to a local minima, while missing a global minima.

userbinator · on Oct 26, 2014

Optimization algorithms will lead the result to a local minima, while missing a global minima.

I think this is the one of the huge differences between humans and compilers that leads to suboptimal code - the compiler is completely unaware of what the programmer intended in the HLL source, so the best it can do is to generate code which it "knows" is the fastest way to implement exactly those statements; on the other hand, if you are a human writing an algorithm in Asm, you probably have things like instruction selection and register allocation already in mind and are willing to slightly change the "semantics" of your algorithm if it results in e.g. a better use of registers or the possibility to use a faster/smaller instruction.

In other words, the compiler starts with something horribly unoptimised but following the semantics of the source language, and tries to make that as fast/small as it can while still maintaining the same semantics - and so this is the "local minima" it can reach. The human starts with a concept of an algorithm and its intended result, and everything in-between is flexible, so this is a more "global" view of the problem of code generation. The human isn't restricted by the semantics of the HLL, but can fully express the intended algorithm by exploiting the full capabilities of the processor.

WalterBright · on Oct 26, 2014

The human also knows things that the compiler doesn't - such as which branches of the code are speed critical and which aren't. This matters for things like where precious registers get allocated.

Ono-Sendai · on Oct 26, 2014

If the function is defined on a finite domain, you can (in theory) find the fastest algorithm for it: http://www.forwardscattering.org/post/13 In practice, something like hill climbing will be done, as you said. Actually it's not even hill climbing, since some of the changes will be pessimisations. (as per the original post)

WalterBright · on Oct 26, 2014

Many optimizations in isolation are pessimizations, they only produce net improvement when used in combination with other optimizations.

daniel-cussen · on Oct 26, 2014

> But even once you know (definitively) that the code could be faster, it's often really tough to convince a given compiler to generate the code that you want. And coming up with a single construct that works well for multiple compilers and remains readable is often impossible.

I once struggled with GCC to output the code I wanted (I wanted indirect unconditional jumps) and I finally found that the way to tell GCC (and perhaps other compilers) to do this is to use goto's (of the "considered harmful variety") with function pointers. And it was, IIRC, twice as fast, and didn't involve any assembler.

I think the deeper problem with compilers is they target code and algorithms that are already known. If you compile anything it hasn't seen before, you're really throwing it a curveball.

dalke · on Oct 26, 2014

> "For example, as of this writing, the first two google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel’s hardware popcnt instruction is slower than a sofware implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction."

I believe that my benchmark is one of those top hits. My description is at http://www.dalkescientific.com/writings/diary/archive/2011/1... . I wrote "My answer is that if you have a chip with the POPCNT instruction built-in then use it. I still don't have one of those chips, but I know someone who does, who has given me some numbers. "

My own code's logic is "if POPCNT exists then use it, otherwise test one of a few possibilities to find the fastest, since the best choice depends on the hardware."

I now have a machine with a hardware POPCNT, and a version with inline assembly. I should rerun the numbers...

Alupis · on Oct 25, 2014

I seem to recall that Clang consistently produces slower executable binaries than gcc does, although Clang compiles quicker. (The reason a lot of people use Clang for development but use gcc for the final production build)

Perhaps the assembly it's producing isn't quite as optimized as it could be if it were carefully hand crafted.

stephencanon · on Oct 26, 2014

That may have been true in the past. All the data I've seen in the past few years has GCC generating faster code for some benchmarks, and clang generating faster code for others, with no really consistent pattern (this is by no means an exhaustive sample, but it is certainly large enough to be vaguely representative).

Interestingly (to me), I am seeing more and more examples where clang and gcc are beating icc in performance.

Alupis · on Oct 26, 2014

> Interestingly (to me), I am seeing more and more examples where clang and gcc are beating icc in performance.

Intel has finite resources (as much as Intel can realistically dedicate to the project), where-as both Clang and GCC projects have theoretically unlimited resources (as much as companies the world-over can dedicate to the project). Doesn't surprise me actually. Most say gcc version 4.7.1 overtook icc in performance by about 15% (and 4.7.1 is ancient now).

alexdowad · on Oct 25, 2014

Interesting to see this on HN, just a couple months after I ran into the same problem with GCC's intrinsic popcount myself on a project! I found that I needed to use an "-march" argument for GCC to actually emit POPCNT instructions.

pja · on Oct 25, 2014

That's unsurprising, since without -march, gcc will emit code for a generic cpu that doesn't have a popcnt instruction. Early x86_64 cpus didn't have popcnt, so neither will the code that gcc emits & it's the same for any other architecture.

If you want to use instructions that are only available on or after a specific cpu generation, then you need to tell your compiler to target that cpu's instruction set.

wolf550e · on Oct 26, 2014

Isn't gcc's default march=native so it would emit popcount if the host supports it?

gsg · on Oct 26, 2014

No, the default produces binaries that are compatible with as many machines as possible.

wolf550e · on Oct 26, 2014

You're right [0], but I couldn't find which march value [1] corresponds to the settings used for "x86_64-linux-gnu".

EDIT: running "echo | gcc -v -E - 2>&1 | grep cc1" shows the command line, which says "-mtune=generic -march=x86-64". I do not know what "-march=x86-64" implies, except SSE2.

0 - https://gcc.gnu.org/ml/gcc-help/2012-12/msg00074.html

1 - https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86-64-Options.h...

throwawayaway · on Oct 25, 2014

finally i think this article and comment thread will settle this contentious debate - definitively, and we can all move on - with the decision made once and for all future possible cases and desired outcomes.