The second example is just a benchmark of tzcnt, added in BMI1. It's a very spec...

BeeOnRope · on Dec 6, 2019

Zen2 is on uops.info, it's 2L0.5T on Zen, 3L1T on Intel, so slight theoretical edge for AMD (2 vs 1 uops tho).

That said, I don't agree it's a tzcnt benchmark - there are about 9 instructions only one of which is tzcnt. I'm not sure why Zen2 is worse here.

reitzensteinm · on Dec 6, 2019

You're right, I messed that up (though I'll leave it for posterity). I went into it with a bias thinking BMI was slow on Zen, since PDEP is 18 cycles vs 1 on Skylake, much to my disappointment back in the day.

After reviewing the example again, there's no obvious reason why Zen 2 is slower, although it's likely a rare edge case. Too bad there's nothing decent like VTune on AMD platforms.

I remember one session where my choice of temporary register significantly impacted throughput while implementing an unrolled int[] hash fn on my Kaby Lake processor. I never figured out exactly why, but sharp edges do exist even on Intel chips.

BeeOnRope · on Dec 6, 2019

This benchmark heavily stresses branch misprediction recovery, so that could be worse on Zen.

Also, I could not reproduce Daniel's results: I got IPC of 1.77 (SKX) or 2.00 (SKL) compared to Daniel's reported 2.80 (SKL, I think), so Intel still better but by a smaller margin. Waiting for clarification on that one.