EDIT: after I wrote the below I realize I could use automatic multi-versioning s...

EDIT: after I wrote the below I realize I could use automatic multi-versioning solely to configure the individual functions, along with with a stub function indicating "was compiled for this arch?" I think that might be more effective should I need to revisit how I support multiple processor architecture dispatch. I will still need the code generation step.

Automatic multi-versioning doesn't handled what I needed, at least not when I started.

I needed a fast way to compute the popcount.

10 years ago, before most machines supported POPCNT, I implemented a variety of popcount algorithms (see https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... ) and found that the fastest version depended on more that just the CPU instruction set.

I ended up running some timings during startup to figure out the fastest version appropriate to the given hardware, with the option to override it (via environment variables) for things like benchmark comparisons. I used it to generate that table I linked to.

Function multi-versioning - which I only learned about a few month ago - isn't meant to handle that flexibility. To my understanding.

I still have one code path which uses __popcountll built-in intrinsics and another which has inline POPCNT assembly, so I can identify when it's no longer useful to have the inline assembly.

(Though I used AVX2 if available, I've also read that some of the AMD processors have several POPCNT execution ports, so may be faster than using AVX2 for my 1024-bit popcount case. I have the run-time option to choose which to use, if I ever have access to those processors.)

Furthermore, my code generation has one path for single-threaded use and one code path for OpenMP, because I found single-threaded-using-OpenMP was slower than single-threaded-without-OpenMP and it would crash on multithreaded macOS programs, due to conflicts between gcc's OpenMP implementation and Apple's POSIX threads implementation.

The AVX2 popcount is from Muła, Kurz, and Lemire, https://academic.oup.com/comjnl/article-abstract/61/1/111/38... , with manually added prefetch instructions (implemented by Kurz). It does not appear that SIMD Everywhere is the right route for me.