Intel's algorithm is very clever and would take a lot of space to implement in h...

Intel's algorithm is very clever and would take a lot of space to implement in hardware. The implementation underlying the CRC32** instructions is probably some set of shift registers. That's a pretty good space/speed tradeoff to make.

My largely uninformed guess is that they added the instructions to get fast CRCs for the filesystem 'for free'. There aren't many other cases where software CRC can be a bottleneck that also use these polynomials.