For trivial loop, Both GCC and clang optimize away the entire BasicLoop, but clang failed to optimize away the entire DuffDevice, while GCC did. Thus, the only thing doing the work is clang's Duff Device.
But even then, both GCC and clang optimize the BasicLoop function itself into just single add operation (data += loop_size) while it actually loops on the duff device function (though as mentioned, these functions weren't called from the benchmark)
For trivial loop, Both GCC and clang optimize away the entire BasicLoop, but clang failed to optimize away the entire DuffDevice, while GCC did. Thus, the only thing doing the work is clang's Duff Device.
But even then, both GCC and clang optimize the BasicLoop function itself into just single add operation (data += loop_size) while it actually loops on the duff device function (though as mentioned, these functions weren't called from the benchmark)