Hacker News new | past | comments | ask | show | jobs | submit login

The second example is just a benchmark of tzcnt, added in BMI1. It's a very specific and very bizarre benchmark to do when you could just look up the reciprocal throughput (unfortunately Zen 2 has not yet been added).

https://www.agner.org/optimize/instruction_tables.pdf

Edit: This is wrong as BeeOnRope points out below.

The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of the areas where Zen 1 was very weak is a good thing.




Zen2 is on uops.info, it's 2L0.5T on Zen, 3L1T on Intel, so slight theoretical edge for AMD (2 vs 1 uops tho).

That said, I don't agree it's a tzcnt benchmark - there are about 9 instructions only one of which is tzcnt. I'm not sure why Zen2 is worse here.


You're right, I messed that up (though I'll leave it for posterity). I went into it with a bias thinking BMI was slow on Zen, since PDEP is 18 cycles vs 1 on Skylake, much to my disappointment back in the day.

After reviewing the example again, there's no obvious reason why Zen 2 is slower, although it's likely a rare edge case. Too bad there's nothing decent like VTune on AMD platforms.

I remember one session where my choice of temporary register significantly impacted throughput while implementing an unrolled int[] hash fn on my Kaby Lake processor. I never figured out exactly why, but sharp edges do exist even on Intel chips.


This benchmark heavily stresses branch misprediction recovery, so that could be worse on Zen.

Also, I could not reproduce Daniel's results: I got IPC of 1.77 (SKX) or 2.00 (SKL) compared to Daniel's reported 2.80 (SKL, I think), so Intel still better but by a smaller margin. Waiting for clarification on that one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: