One can vectorize LUTs as well https://www.intel.com/content/www/us/en/docs/intr...

boulos · 2024-05-16T03:25:30 1715829930

Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.

janwas · 2024-05-16T10:17:51 1715854671

With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)

celrod · 2024-05-16T10:39:51 1715855991

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.