And I benchmarked my safelibc implementation of memcpy_s with a good compiler, clang >4, beating the handoptimized and unsafe glibc assembler version by ~5%. gcc stinks there. https://github.com/rurban/safeclib/blob/master/tests/perf_me...
Let the compiler optimize it.
You don't see the point. The point is to write proper loops in C which the compiler can optimize. This the fastest way and benefits all platforms. Special cases which are actually slower should be removed.
Compilers which are not able to optimize proper loop patterns should be fixed and documented as such. Haven't checked gcc-8 yet, but older versions should not be used for such core functions. libc should be compiled with clang.
On my machine, the following takes about 161 ms (that's the best of 5 timings on a non-quiet machine):
The author's for-loop is optimistic, and doesn't short-circuit exit on mismatch, so I'll replace it with a non-branching equivalent: That takes 310 ms (compiled with Apple LLVM version 8.0.0, using -O3). About 1/2 the performance of built-in.Replace the int with a long and it's 200ms. Manually unroll by 4 and it's 175ms. My final code is:
That's only somewhat slower than the memcmp, not the 10x that the author reports.Of course, memcmp does short-circuit evaluation, and I know the size is a multiple of sizeof(long), so this test case is biased in favor of my code.
I would, of course, always favor the tested, often optimized library code over rolling my own unless there are good reasons otherwise.