Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.
Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd.
This approach was also recommended by the Intel optimization manual -- a great resource.
I just went with straight math for single-precision, though.
I wrote about the kinds of things that are possible with LUTs awhile back https://darkcephas.blogspot.com/2018/10/validating-utf8-stri...