Some comments about Duff's Device from a compiler writer:
The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph. In simple terms, this is a cycle in the graph that has two entry points from outside the cycle, and as a result, it is no longer considered a loop. This means you're going to disable any loop optimization--so Duff's Device is only worth it if you're going to write the optimal loop form already, and the compiler isn't going to arrive at that form.
For the latter part, optimization level matters: -Og and -O1 won't run loop optimizations, and -Os and -Oz might not. But if you're compiling with -Og or -O1, you don't actually care about performance in the first place. And with -Os or -Oz, the loop unrolling implied is of questionable utility (since you're increasing code size). If you really want it anyways... just add a pragma. All the major compilers support a loop unrolling pragma.
The benchmarks introduced to justify using duff's device are... bad. In the original use, the loop was basically doing this:
void duffs(volatile int *x, int *y, int N) {
for (int i = 0; i < N; i++)
*x = *y++;
}
Here, what's being done instead is:
void bad_duffs(int *x, int y, int N) {
for (int i = 0; i < N; i++)
*x = y++;
}
That doesn't seem like much, but it makes the code optimizable to x = N (N + 1) / 2; if the compiler can work out that x cannot point to y or N, and that y starts out at 0.
The last example makes it even worse by doing ++*x; instead, which means it's entirely down to if the compiler can compute N in the first place. But to justify the use, the N is made obtusely hard to compute, and the "Duff's Device" version doesn't even increment it all the times it's supposed to.
Correct. Duff's device is for MMIO which means it needs to do explicit memory writes. We are not doing arithmetic here, we're doing I/O. In C (and C++) you express that by labelling the variable volatile.
Today MMIO is very slow and so you just don't need unrolling. Or I guess from another perspective, MMIO isn't as much faster as absolutely everything else is now, and so again you don't need unrolling. Your CPU can do some arithmetic, make a conditional jump, and still schedule the next MMIO write in plenty of time. Therefore Duff's device is now irrelevant.
The ++*x is completely toxic in the context Tom Duff wrote this - where we want actual writes, because then it actually amounts to this:
int tmp = read_mmio();
tmp++;
write_mmio(tmp);
And immediately you should be filled with dread, who said we could read this MMIO register? What happen if the contents of the MMIO register change while we're twiddling tmp? This all seems like a very bad idea.
What's more curious than the irreducability issue is, how on earth does the syntax ever get accepted by the parser?
I mean the lex/yacc syntax has to reflect the nestability of the while/switch [1], ditto if you do recursive descent, so how is it possible this even parses?
Because the switch statement is not multiple conditions with a funny syntax, it is a computed goto with funny labels. Those labels are permitted to be anywhere at all inside its associated statement, including perverse things like
switch (x) default: {
if (y) case 1: case 2: return;
/* more stuff with more labels */
}
Ugh, very simply? Any STATEMENT is allowed to be, among other things, a LABELED_STATEMENT which is (<ident> ":" STATEMENT | "case" <constexpr> ":" STATEMENT | "default" ":" STATEMENT). Then in the semantic pass it's checked that labeled statements starting with case/default must be inside some switch. That's it.
This is a good question. The grammar of the C switch statement is not context-free. In practice this is easiest to resolve by just allowing case labels (which in the end are just decorated goto labels!) to appear anywhere, when it comes to parsing, and only check the existence of an enclosing switch during semantic analysis, which is necessarily context-sensitive anyway.
Where’s the problem for context-freeness here? Just have two copies of every statement production, one that doesn’t allow for case labels and one that does, have function bodies start as the former and switch bodies unconditionally switch to the latter. It’d be thoroughly unpleasant to deal with in an implementation and not very helpful for the eventual user, but as a pure existence proof I don’t see any problems.
Yes, that's true. Maybe I should say that it's effectively context-sensitive. In any case, the grammar as specified by the standard is context-free and the restriction of where case labels can occur is a separate semantic requirement.
>> The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph.
The technique is considered obsolete with todays compilers specifically because they can do loop optimizations like this without writing strange C code.
OTOH you never know what those wacky compiler guys are going to do. For example, it seems GCC hasn't been doing any vectorization on x86-64 with standard -O2 even though the ISA has supported SSE2 as a minimum from day one. This is being fixed (to some extent) in the next release.
The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph. In simple terms, this is a cycle in the graph that has two entry points from outside the cycle, and as a result, it is no longer considered a loop. This means you're going to disable any loop optimization--so Duff's Device is only worth it if you're going to write the optimal loop form already, and the compiler isn't going to arrive at that form.
For the latter part, optimization level matters: -Og and -O1 won't run loop optimizations, and -Os and -Oz might not. But if you're compiling with -Og or -O1, you don't actually care about performance in the first place. And with -Os or -Oz, the loop unrolling implied is of questionable utility (since you're increasing code size). If you really want it anyways... just add a pragma. All the major compilers support a loop unrolling pragma.
The benchmarks introduced to justify using duff's device are... bad. In the original use, the loop was basically doing this:
Here, what's being done instead is: That doesn't seem like much, but it makes the code optimizable to x = N (N + 1) / 2; if the compiler can work out that x cannot point to y or N, and that y starts out at 0.The last example makes it even worse by doing ++*x; instead, which means it's entirely down to if the compiler can compute N in the first place. But to justify the use, the N is made obtusely hard to compute, and the "Duff's Device" version doesn't even increment it all the times it's supposed to.