Hacker News new | past | comments | ask | show | jobs | submit login

Hmm, yeah, this might work out... Two SIMD uops process 16 bytes, so each SIMD uop is doing eight bytes of work - the same as CRC32X, but with more frontend pressure (and preferable because they can run on any of the four SIMD ports, not just the one distinct CRC32X port).

It gets a bit messy, and we can't expect a ton from this approach - the same loop with only the loads only runs at ~86GB/s, but it'd be worth a shot.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: