The second example is just a benchmark of tzcnt, added in BMI1. It's a very specific and very bizarre benchmark to do when you could just look up the reciprocal throughput (unfortunately Zen 2 has not yet been added).
You're right, I messed that up (though I'll leave it for posterity). I went into it with a bias thinking BMI was slow on Zen, since PDEP is 18 cycles vs 1 on Skylake, much to my disappointment back in the day.
After reviewing the example again, there's no obvious reason why Zen 2 is slower, although it's likely a rare edge case. Too bad there's nothing decent like VTune on AMD platforms.
I remember one session where my choice of temporary register significantly impacted throughput while implementing an unrolled int[] hash fn on my Kaby Lake processor. I never figured out exactly why, but sharp edges do exist even on Intel chips.
This benchmark heavily stresses branch misprediction recovery, so that could be worse on Zen.
Also, I could not reproduce Daniel's results: I got IPC of 1.77 (SKX) or 2.00 (SKL) compared to Daniel's reported 2.80 (SKL, I think), so Intel still better but by a smaller margin. Waiting for clarification on that one.
https://www.agner.org/optimize/instruction_tables.pdf
Edit: This is wrong as BeeOnRope points out below.
The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of the areas where Zen 1 was very weak is a good thing.