Thanks for testing it. You inspired me to try as well, and I've found that Intel...

Ruud-v-A · on Dec 1, 2016

Actually, the coefficients are only 16 bit, they are just widened before the loop. AVX2 and later are not widely supported, but there is pmuldq in SSE4.1 which is appropriate here. It takes 32 bit operands and and produces a 64 bit result. Because its result is 64 bit (as it should be by the way, because truncating to 32 bit will overflow on real-world FLAC files) it can only do two elements per operation. A few adds can be vectorised too then, but you still need a horizontal add to get the final value. I think it is possible to get a small speedup, but definitely not the 4× or 8× that you might hope for with SSE/AVX. But at this point I would say that this is a pretty advanced optimisation, not your average "vectorise all operations in the loop, increment the counter by 4/8 at a time" transformation that a compiler can do easily. I am impressed that the Intel compiler can actually vectorise things here.