Hmm, yeah, this might work out... Two SIMD uops process 16 bytes, so each SIMD uop is doing eight bytes of work - the same as CRC32X, but with more frontend pressure (and preferable because they can run on any of the four SIMD ports, not just the one distinct CRC32X port).
It gets a bit messy, and we can't expect a ton from this approach - the same loop with only the loads only runs at ~86GB/s, but it'd be worth a shot.
It gets a bit messy, and we can't expect a ton from this approach - the same loop with only the loads only runs at ~86GB/s, but it'd be worth a shot.