The last time I did anything like this (it was for CPU linear algebra code designed to run in very heterogeneous clusters), I first came up with a parameterization that approximated how I'd expect an algorithm to perform. Then, once for each hardware combination, you sweep through the possible parameterization space. I used log-scaled quantization to make it cheap to index into an array of function pointers based on input specifics.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.