Hacker News new | past | comments | ask | show | jobs | submit login

Specifically, the M1's NEON ALUs can handle 64 bytes per cycle on the big cores. Most Cortex can only do 32 bytes per cycle (Cortex X1/X2 is the exception), and the lower end designs are only 16B/cycle. Plus the pmull+eor fusion on M1 increases effective throughput by 33%, and I don't know if that's implemented on any Cortex.

Without those, the NEON version would fall to the same throughput as one CRC32X per cycle, or half the throughput on cores with 2x64bit ALUs.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: