Specifically, the M1's NEON ALUs can handle 64 bytes per cycle on the big cores....

Specifically, the M1's NEON ALUs can handle 64 bytes per cycle on the big cores. Most Cortex can only do 32 bytes per cycle (Cortex X1/X2 is the exception), and the lower end designs are only 16B/cycle. Plus the pmull+eor fusion on M1 increases effective throughput by 33%, and I don't know if that's implemented on any Cortex.

Without those, the NEON version would fall to the same throughput as one CRC32X per cycle, or half the throughput on cores with 2x64bit ALUs.