Some comments about Duff's Device from a compiler writer:
The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph. In simple terms, this is a cycle in the graph that has two entry points from outside the cycle, and as a result, it is no longer considered a loop. This means you're going to disable any loop optimization--so Duff's Device is only worth it if you're going to write the optimal loop form already, and the compiler isn't going to arrive at that form.
For the latter part, optimization level matters: -Og and -O1 won't run loop optimizations, and -Os and -Oz might not. But if you're compiling with -Og or -O1, you don't actually care about performance in the first place. And with -Os or -Oz, the loop unrolling implied is of questionable utility (since you're increasing code size). If you really want it anyways... just add a pragma. All the major compilers support a loop unrolling pragma.
The benchmarks introduced to justify using duff's device are... bad. In the original use, the loop was basically doing this:
void duffs(volatile int *x, int *y, int N) {
for (int i = 0; i < N; i++)
*x = *y++;
}
Here, what's being done instead is:
void bad_duffs(int *x, int y, int N) {
for (int i = 0; i < N; i++)
*x = y++;
}
That doesn't seem like much, but it makes the code optimizable to x = N (N + 1) / 2; if the compiler can work out that x cannot point to y or N, and that y starts out at 0.
The last example makes it even worse by doing ++*x; instead, which means it's entirely down to if the compiler can compute N in the first place. But to justify the use, the N is made obtusely hard to compute, and the "Duff's Device" version doesn't even increment it all the times it's supposed to.
Correct. Duff's device is for MMIO which means it needs to do explicit memory writes. We are not doing arithmetic here, we're doing I/O. In C (and C++) you express that by labelling the variable volatile.
Today MMIO is very slow and so you just don't need unrolling. Or I guess from another perspective, MMIO isn't as much faster as absolutely everything else is now, and so again you don't need unrolling. Your CPU can do some arithmetic, make a conditional jump, and still schedule the next MMIO write in plenty of time. Therefore Duff's device is now irrelevant.
The ++*x is completely toxic in the context Tom Duff wrote this - where we want actual writes, because then it actually amounts to this:
int tmp = read_mmio();
tmp++;
write_mmio(tmp);
And immediately you should be filled with dread, who said we could read this MMIO register? What happen if the contents of the MMIO register change while we're twiddling tmp? This all seems like a very bad idea.
What's more curious than the irreducability issue is, how on earth does the syntax ever get accepted by the parser?
I mean the lex/yacc syntax has to reflect the nestability of the while/switch [1], ditto if you do recursive descent, so how is it possible this even parses?
Because the switch statement is not multiple conditions with a funny syntax, it is a computed goto with funny labels. Those labels are permitted to be anywhere at all inside its associated statement, including perverse things like
switch (x) default: {
if (y) case 1: case 2: return;
/* more stuff with more labels */
}
Ugh, very simply? Any STATEMENT is allowed to be, among other things, a LABELED_STATEMENT which is (<ident> ":" STATEMENT | "case" <constexpr> ":" STATEMENT | "default" ":" STATEMENT). Then in the semantic pass it's checked that labeled statements starting with case/default must be inside some switch. That's it.
This is a good question. The grammar of the C switch statement is not context-free. In practice this is easiest to resolve by just allowing case labels (which in the end are just decorated goto labels!) to appear anywhere, when it comes to parsing, and only check the existence of an enclosing switch during semantic analysis, which is necessarily context-sensitive anyway.
Where’s the problem for context-freeness here? Just have two copies of every statement production, one that doesn’t allow for case labels and one that does, have function bodies start as the former and switch bodies unconditionally switch to the latter. It’d be thoroughly unpleasant to deal with in an implementation and not very helpful for the eventual user, but as a pure existence proof I don’t see any problems.
Yes, that's true. Maybe I should say that it's effectively context-sensitive. In any case, the grammar as specified by the standard is context-free and the restriction of where case labels can occur is a separate semantic requirement.
>> The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph.
The technique is considered obsolete with todays compilers specifically because they can do loop optimizations like this without writing strange C code.
OTOH you never know what those wacky compiler guys are going to do. For example, it seems GCC hasn't been doing any vectorization on x86-64 with standard -O2 even though the ISA has supported SSE2 as a minimum from day one. This is being fixed (to some extent) in the next release.
I think the author is making a fairly critical mistake here. If I understand correctly, they missed the fact that the -O2 and -O3 results are several orders of magnitude faster... Likely meaning the compiler optimized away the whole loop in those cases, rendering their results useless.
Edit: Yeah, looking at the godbolt link they provided, at least the execute_loop_basic function does essentially no work.
The question posed early in the blog was whether Duff's Device is still relevant in 2021. The fact that the compiler can completely elide the loop without it but cannot with it is a pretty good indicator that it's likely to cause more harm than good.
It might prevent inlining of a function call in the loop body, which would make a loop much, much slower to execute. It will make the loop body larger (even without inlining) and likely cause an increase in cache misses. It will be unpredictable based on build: an unrelated change somewhere else might misalign it with a cache line. Different CPUs might have different cache behaviours and perform differently.
The right view on this sort of thing hasn't changed for decades: don't optimize prematurely. Trust your compiler, check it with profiling, look first for algorithmic improvements because you can gain FAR more converting an O(n^2) algorithm to an O(nlogn) than you can by unrolling a loop. Duff's Device is obsolete in all but a vanishingly tiny number of edge cases.
Yes, they still are useless, at least for the -O3 and -O4 sections.
The compiler removing the loop is due to a bad test setup in which the function in the loop does nothing and compiler can figure that out and skip the whole calculation. In a real scenario the function in the loop would (presumably) actually do some work so it could not be elided. Also, note that the compiler does this for BOTH the Duff's device version and the non Duff's device version.
Also, I'm neither attacking nor defending Duff's device. I'm just saying that we can't draw any useful conclusions from this article because of poor methodology.
For trivial loop, Both GCC and clang optimize away the entire BasicLoop, but clang failed to optimize away the entire DuffDevice, while GCC did. Thus, the only thing doing the work is clang's Duff Device.
But even then, both GCC and clang optimize the BasicLoop function itself into just single add operation (data += loop_size) while it actually loops on the duff device function (though as mentioned, these functions weren't called from the benchmark)
The thing is duffs device is about IO, so the loop cannot be optimized away as it isn't a pure function. The example code though is a pure function and so the compile is justified in optimizing it away.
We have long known that in general compilers are better at optimization than humans so you should write readable code until a profiler shows otherwise. However there are cases where the compiler doesn't make an optimization. There are sometimes corner cases that the compiler can't figure out doesn't apply to you, so manually doing the optimization might work. These cases need to be re-evaluated with every CPU change, and every compiler upgrade though.
Nothing above says if duff's device applies to not though. Duff's device is very likely to hit a corner cases where the compiler isn't sure if it can safely apply optimizations so it won't. As such I wouldn't be surprised if it is still relevant. (though mostly on embedded systems where we don't have nearly as fast of CPUs as more common computers)
I'm a prolific user of Duff's device for poor man's coroutines in C, but in non-toy applications I do try to isolate the behavior to discrete functional tasks and entirely behind a simple, conventional API. For example, implementing a single-function iterator using a generator pattern: https://github.com/wahern/lunix/blob/36c2107/src/unix-getopt... IOW, I avoid the urge to encapsulate the magic in a library abstraction so it can be reused (and abused) everywhere. There's no spooky action at a distance--you can figure out the behavior from the surrounding code, without having to pour through internal library code (even just hiding stuff in a header can be confusing). The details remain entirely irrelevant to callers. And I minimize opportunities for doing things I might regret (and have regretted), like implementing and relying upon a bespoke threading framework =)
Context here is embedded: I wrote an entire I2C driver on AVR using protothreads. This was so I could just call a single update function in my main loop and it would handle all the various delays.
Do not recommend.
If I had my time again, I would do it with a explicit state machine that runs in the I2C interrupt handler, and a submission and response queue accessed by the main app loop (the equivalent of "userland" for non-embedded folks).
One cool use I saw of it was in the instruction decoder for bochs. Duffs device is really nice for decoding instruction sets that are not necessarily regular in size (such as x86).
My fondest memory of Duff's device is implementing it for performance purposes in Java and watching the unexpected Havoc it wrought on Java decompilers.
A few lines of otherwise easily decompile-able Java turned into hundreds of lines of dealing with "broken" control flow (if they didn't just crash outright).
The loop un-rollers in the old Intel high performance C compiler used to emit code like this.
It was sometimes pure joy telling it to emit the assembly language and looking at the code to work out what it was doing. Intel must have had, and still probably has, some awesomely clever people in the basement writing that stuff.
I still hope they'll open source it one day. I never really got to use it in anger, and I bet that when you use PGO all the compilers end up pretty similar in the "real world" but it would be a shame for all the work to be lost
Vectorizing by hand is not very difficult, but even for moderately complex loops it gets tiresome very quickly. And that is before considering loop peeling for vector alignment, adding both vector and scalar remainder loops and loop multiversioning.
This is really something you want your compiler to take care of. And that indeed is something icc excels at. You just don't have to care about code size, because that _will_ grow, big time.
And no, Intel's new LLVM-based icx compiler is not at the same level yet as the (now 'classic') icc compiler.
It's still better, but the gap's closed quite a bit in my experience.
It also still seems better than clang/gcc at working out whether inlining is worth it or not.
But it's also still quite buggy (it used to be as well, the number of work-arounds/#ifdefs needed to get some cross-platform applications built used to be quite annoying. In some cases, we couldn't even use exactly the same version of the compiler per platform, we had to split versions per platform!).
You'd think that if anyone had motivation to open source a high performance compiler it would be the company making the chips. Anyone know why they don't?
I think, also based on the previous comment, it is still better at some things than the open source alternatives. If you really need it for the specific situations it excels in, I guess it makes sense to pay for it.
There's nothing preventing you from using their compiler with other people's chips right now.
Presumably intel only optimized for intel. Wouldn't the best case scenario be if everyone copied the intel optimizations leaving other chips in the dust?
> There's nothing preventing you from using their compiler with other people's chips right now.
Using an other ISA than x86 is something that prevents this compiler from being used for other chips.
The intermediate representation on which these optimizations are made is probably not related to x86.
Also, such compilers usually use models of machine resources and latencies to perform better scheduling.
But latency and resource models are micro-architecture dependent, so using it on a chip with the same architecture but different micro-architecture would produce non-optimal scheduling
If it was open source then the optimisations would be swiftly taken and adapted to other people's chips and Intel chips would no longer have an advantage.
As mentioned by at least one other comment (buried in a thread), in reality, don't do this. If you care about the performance of the loop /that/ much, just hand code the vectorization/assembly yourself. Use some sort of profiler to make sure it hits peak throughput and declare victory.
> As opposed to writing worse performing and less readable code?
I can count the number of coworkers I have that have experience with inline assembly on one hand (it is less than 1). Also the first reaction I usually get to vector intrinsics are questions about the wtfness of shuffle instructions, no you can't make that readable without sacrificing performance.
> Then just write it in the simplest way and hope autovectorization will help you.
Spoiler: It wont't. Compilers often don't have the context and some of the biggest hot spots I had to deal with simply used the single value versions of vector instructions.
Or rather: If that had worked I wouldn't be there hand optimizing the code.
Duffs device wasn't about computation, it was about writing bytes to a serial port. That is constantly writing to one memory location. Just a bit of being a pendant that you.should expect.
> How come that for the same code, Clang shows totally different results than GCC and shows that the basic loop is thirty thousand times faster with -O2 and -O3?
Because clang turns it into `return data+loop_size;`.
In reality of course if you're considering Duff's device, your operations should be much heavier than any of the operations the author tested. And yes, those cases still exist, but people usually consider it cheaper to perpetually throw more compute resources at the problem.
If your operations are much heavier, then the cost of the loop condition becomes negligible. And if they're not as heavy as you thought (because the compiler's better at logic puzzles than you!), then you've wasted a whole lot more energy than those loop conditions would have.
It depends what operation is much heavier. Remember that the loop condition itself can be heavy. And that compilers can't optimize away things like disk or network access time.
The best approach is to generate custom code, with compile-time constant loop roll count and compile-time constant tail iteration count, like NN-512 (https://NN-512.com) does for almost every loop.
Write your code generator in a developer-efficient, compiler-oriented language like Go. Use it to generate simple C99 that GCC can optimize to produce machine-efficient object code.
This is how you get truly great object code from GCC.
The loop size is all that matters. The thing that's being optimized is just:
for (int i = 0; i < loop_size; i++) {
some_work();
}
Whatever some_work may be you want to unroll the loop to avoid the jumps. It could be an operation like:
c[i] = a[i] + b[i]
In which case you'd do like you suggest. But it could also just be some repeated operation on an entire data set, like in a simulation where you want to run loop_size iterations.
>There needs to be a dependency on the index for duff's device to work right?
For it to work "right" the work inside the loop has to be quite minor. The point being that it saves on cpu cycles by reducing the number of times it does i < loop_size, i++, and the number of times it has to branch back to the beginning of the loop. It's almost always some kind of memory copy, or a simple read, lookup, write in which the inside of the loop will complete in a few cpu cycles.
If the work inside the loop is 100x the cost of implementing the loop there is no point.
I used something similar to Duff’s device to implement coroutines in C++ (pre-C++20). With some macros and using lambda function the result almost feels like part of the language.
No UB. Static analysis will keep you from doing things like using values in local variables across yields (use lambda captures instead).
Just avoid doing two CO_YIELD() in the same line because this macro uses __LINE__ for the switch case.
Many years ago when I was in high school, a friend of mine told me about Duff’s device. I filed it away in the back of my mind, excited that one day when I was a professional programmer I could pull this trick out of my hat. Looks like that day may never come.
The code, as written, will execute the computation 4 times when passed a loop_size of 0, making it less efficient than the naive code, but also wrong for that case.
The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph. In simple terms, this is a cycle in the graph that has two entry points from outside the cycle, and as a result, it is no longer considered a loop. This means you're going to disable any loop optimization--so Duff's Device is only worth it if you're going to write the optimal loop form already, and the compiler isn't going to arrive at that form.
For the latter part, optimization level matters: -Og and -O1 won't run loop optimizations, and -Os and -Oz might not. But if you're compiling with -Og or -O1, you don't actually care about performance in the first place. And with -Os or -Oz, the loop unrolling implied is of questionable utility (since you're increasing code size). If you really want it anyways... just add a pragma. All the major compilers support a loop unrolling pragma.
The benchmarks introduced to justify using duff's device are... bad. In the original use, the loop was basically doing this:
Here, what's being done instead is: That doesn't seem like much, but it makes the code optimizable to x = N (N + 1) / 2; if the compiler can work out that x cannot point to y or N, and that y starts out at 0.The last example makes it even worse by doing ++*x; instead, which means it's entirely down to if the compiler can compute N in the first place. But to justify the use, the N is made obtusely hard to compute, and the "Duff's Device" version doesn't even increment it all the times it's supposed to.