If you're using AVX instructions make sure to insert a vzeroupper after you're done, otherwise you take a big speed hit transitioning between SSE and AVX. The compiler will generate these automatically for AVX instructions it is emitting:
Little known fact about AVX: every time you use an AVX instruction Intel processors reduce the core frequency by a few percents for about a millisecond to avoid overheating (I guess because they need to turn on an otherwise unused functional unit).
As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).
"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."
...
"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."
https://godbolt.org/g/jNXJyA
Here's an intrinsic for it:
http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_c...