The idea is to minimize the number of “am I done?” checks. If you have to copy 100 bytes, a naive approach would be to just do `while (cur < last) { dst[cur] = src[cur++]; }`, but that checks cur < last every single byte.
To make it so you copy multiple bytes per check. But that only works if you have an exact multiple of 8 (in this case) bytes to copy.
The cleverness of duff’s device is in being able to write essentially the above “unrolled” loop, but with an embedded switch statement that skips the correct number of byte copies when the remaining bytes are not a multiple of 8. But you only need to do that once… after you skip the right number of bytes (via the embedded switch statement) the number of remaining bytes is a multiple of 8 and now you’re just doing a normal loop.
C and C++ have implicit fall through for switch statements. So without a “break”, each case will execute after the one chosen.
Most control structures compile down to labels and jumps under the hood. That “while” is jumping to that “do” regardless of whether or not we’ve “executed” that line.
Combine those two facts and you have the what. If you go to “case 5”, you’ll execute cases 5, 4, 3, 2, and 1; then jump up to the “do” to continue looping as normal.
As to the why: “loop unrolling”. A technique to eke out performance when things are truly tight.
Executing a loop is work. There are things to add or subtract, comparisons, etc. If you know a loop will execute 4 times, it can be faster to just repeat the statement 4 times.
If your statement is repeated more than that, it can get unwieldy. But you can reduce the number of loops if you know there’s a number you can divide by. If you have to perform 15 loops, you can loop 5 times, repeating your statement 3 times in the loop.
Duff’s device is a way to do this for loops of unknown sizes. The first iteration of the loop essentially begins in the middle of the loop, then each subsequent iteration is done in groups of 8.
It’s old-school “clever” code. And back in the day, it was a way to save clock cycles when that mattered more. I wouldn’t do this today unless everything else has been tried
There are some other good explanations in the other comments, but I suspect you might want a balance between more detail and the deluge of information in the wikipedia page for loop unrolling.
You might first write something like memcpy as:
void memcpy(char* from, char* to, int n) {
if (n == 0) return;
do {
*from++ = *to++;
} while (n-- > 0);
}
(Yes, this is oversimplified, and also you'd probably have size_t instead of int as the type for n, but it's simplified for explanatory purposes.)
Consider that the loop boils down to something like:
loop:
*from++ = *to++;
if (n-- > 0) goto loop;
Each iteration of the loop corresponds to a single memory write and a single check of the loop condition (branches are generally expensive). If you unuroll your loop once, you might write something:
int iterations = n / 2;
if (n % 2 == 1) {
*from++ = *to++;
n--;
}
do {
*from++ = *to++;
*from++ = *to++;
} while (iterations-- > 0);
This gets you two memory writes for a single check of the loop condition! That said, since you're advancing by two bytes each time, you need to handle the case where you have an odd number of overall bytes to copy.
You can pretty easily show that this code is equivalent to:
int iterations = n / 2;
if (n % 2 == 1) {
goto one;
}
do {
*from++ = *to++;
one:
*from++ = *to++;
} while (iterations-- > 0);
which in turn is not that far from Duff's device with 2x-loop unrolling:
int iterations = n / 2;
switch (n % 2) {
case 0: do { *from++ = *to++;
case 1: *from++ = *to++;
} while (iterations-- > 0);
}
(The core observation here is that ifs, switches, whiles, and other conditional constructs are all conditional jumps under the hood.)
And actually before that, are you familiar with switch statements and the difference between having breaks in the switch cases and falling through from one switch case to the next?