Hacker News new | past | comments | ask | show | jobs | submit login

One can vectorize LUTs as well https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

I wrote about the kinds of things that are possible with LUTs awhile back https://darkcephas.blogspot.com/2018/10/validating-utf8-stri...




Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.


With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)


Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: