Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for testing it. You inspired me to try as well, and I've found that Intel ICC 16 and 17 will vectorize it if you put "#pragma vector always" before the inner loop. Without it, it thinks that the write to buffer[i] might interfere with the reads of buffer[i + j]. I think this is because it's trying to "jam" the inner and outer loop.

In addition, I discovered a couple other issues. I was embarrassed to see that I'd messed up the loop bounds in my example. But the bigger one is that AVX2 and earlier do not have a 64-bit x 64-bit -> 64-bit vector multiplication instruction. VPMULLQ was only added with AVX512, possibly because it requires a 512-bit intermediate.

This means that with 64-bit coefficients, the vector approach has to do the multiplications twice, and shift, and add. This is likely going to be slower than the scalar approach. But if you were able to switch to 32-bit (or floating point) coefficients, I think the vectorized solution looks pretty good. ICC estimates a 1.4x speedup, but I think that gets better with longer buffers.

I put the version with int32_t coefficients up here: https://godbolt.org/g/GXNVe2




Actually, the coefficients are only 16 bit, they are just widened before the loop. AVX2 and later are not widely supported, but there is pmuldq in SSE4.1 which is appropriate here. It takes 32 bit operands and and produces a 64 bit result. Because its result is 64 bit (as it should be by the way, because truncating to 32 bit will overflow on real-world FLAC files) it can only do two elements per operation. A few adds can be vectorised too then, but you still need a horizontal add to get the final value. I think it is possible to get a small speedup, but definitely not the 4× or 8× that you might hope for with SSE/AVX. But at this point I would say that this is a pretty advanced optimisation, not your average "vectorise all operations in the loop, increment the counter by 4/8 at a time" transformation that a compiler can do easily. I am impressed that the Intel compiler can actually vectorise things here.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: