Hacker News new | past | comments | ask | show | jobs | submit login
Instructions per cycle: AMD Zen 2 versus Intel (lemire.me)
108 points by another on Dec 6, 2019 | hide | past | favorite | 74 comments



In case anyone is not aware: This is a very small sample of microbenchmarks. When benchmarking very simple tasks like these performance tend to vary wildly between architectures.

For instance instructions are assigned to one of a handful of ports when executed, certain instructions may only be assigned to certain ports, what ports an instruction may be assigned to differ between architectures. If an inner loop use only a few different instructions one architecture may be unlucky in that most of the instructions need the same ports, and so it can execute fewer instruction overall.

For real benchmarking use lots of different complicated jobs. It is not perfect, but it is the best way we have of comparing different processors head to head.


Indeed. Back in 1999 the AMD K7 was a full 3 times faster than Intel on microbenchmarks measuring the performance of ROR/ROL instructions, because the throughput per clock of these rotate instructions was exactly 3 times higher than on Intel. Obviously this did not mean that AMD was 3 times faster than Intel.

Picking 1 or 2 random microbenchmarks like the blog post author did is not useful to categorize overall performance across all real-world workloads. If he had picked different ones, they might have shown AMD twice faster than Intel.


Examples like that still exist: AMD popcnt throughput is 4x Intel's, for example (4/cycle vs 1).


The author appears to be benchmarking the specific operations that bottleneck their json parsing library when running on an Intel chip, which seems reasonable, on its face. It can fail if the library is limited by a different set of operations, on the different machine. But that is unlikely if the specific operations tested are slower.


Assuming both Intel and AMD implement performance monitors the same (i.e. same notion of instructions executed, which may be hard to measure with speculative execution), the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time.

> However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

That’s precisely why nobody really uses IPC as a way to compare processors. “How much work done per unit of time” is a much better measurement and I guess for historical reasons, people conflate it with IPC.

But real textbook IPC is useless for comparison.


I think it would have been useful if the author benchmarked the actual time taken to parse a large json file, and did a sanity check to make sure the time difference made sense with ipc/clock factored in.


> But real textbook IPC is useless for comparison.

It's useful for comparing architectures and the implementation thereof, to gauge the potential of one line of processors over the other.

I agree that for the customer it's not the right thing to be looking for.


It's really not useful for gauging potential. There are tradeoffs in how deeply you pipeline your architectures that'll tend to result in higher clock rates for shorter pipeline stages but higher IPC for longer pipeline stages, for instance. It's pretty easy to make a design with an IPC that'll blow everything else out of the water if it only needs to hit 100 MHz. For instance the slower a clock cycle is the larger you can make your caches and the less clock cycles it takes to read from them.

Also, on real world benchmarks that don't fit neatly in cache, for a given chip IPC will tend to increase as you underclock it because that will cause memory latency to go down.


Note that the chip with the higher specific frequency in this test, and the higher max frequency across the product line (Skylake+), gets a higher IPC here, so this kind of tradeoff isn't the obvious cause of the results here.


IPC is _usually_ a good measure for the last phase of optimization. But it is only the local Δ that is meaningful, comparing IPC across different vendors is only useful as a gross measure.


It's not even useful as a gross measure, unfortunately. Too many moving parts in the way.

Say, if you used IPC only then you'd probably pick the latest Apple ARM CPU. Except it cannot go as high clock in any of the subunits as top AMD and Intel, cache is slower, and memory bandwidth abysmal in comparison.

Performance in seconds or performance per watt (unit is 1/(W*s)) in the workload you want to run is useful.

You cannot even estimate anything using microbenchmarks anymore easily since they expanded per unit local clocking in x86... (AMD in Zen+ and expanded in Zen 2, most ARM mobile CPUS, Intel since Broadwell E, expanded in Skylake.)

You get traps such as going for AVX and locally overheating the CPU where SSE2 equivalent would go faster in real life. It's all funny business.

IPC also heavily favors RISC instead of SIMD, likewise is biased against multicore. (Though not as much.) What counts as an instruction anyway?


There's no "gauging potential". Would you suddenly go with OICC if it has extremely high IPC? How about old Core instead of new Skylake? Oh shoot, there is no potential in Core if it's not being made!

Even different Zen 2 CPUs have varied performance properties not just due to cores, but due to CCX count.

The exactly one use for such microbenchmark and that's optimizing the compilers.

Even if there were multiple implementations?

Also remember that x86-64 unlike x86 is not closed, and unlike POWER, RISC-V, ARM or MIPS is not actually well defined.

If AMD suddenly adds a new but useful instruction set like they did with 3DNOW in ancient times, or accelerate something reasonably common that way, say add a special SIMD conditional, where do you even start in comparison? What if Intel actually does add a useful FPGA programmable computing capability as promised or enhanced DMA?


I don’t see how it is flawed. The article doesn’t discuss whether the AMD CPU is faster than the Intel CPU, it discusses the claim "that the most recent AMD processors surpass Intel in terms of instructions per cycle” (https://www.guru3d.com/articles_pages/amd_ryzen_7_3800x_revi...)

And IPC, IMO, is a better measurement for a chip’s design than pure speed, as it removes the “but how good a process do you have access to” from the equation.


The article gives 2 benchmarks, I am pretty sure it is easy to mash up another benchmark with totally opposite results (e.g subset of specint). I found author's inclusion of an obviously skewed example as proof a little bit disingenuous as well.

Having said that in general Intel still holds a slight edge on pure Ipc. However, considering the terrible track record of security issues and abysmal price performance ratio, a slight edge on ipc can be ignored and I would not consider Intel for most workloads at the moment. Above all, actual application benchmark trumps any ipc microbencmark.


Some more realistic single core workloads at same frequencies: 3900x vs 9900k https://hothardware.com/reviews/amd-ryzen-9-3900x-vs-core-i9...


I'm this case the frequencies are similar and so wall clock time reflects the IPC difference (also, the two CPUs take the same code path, so the I is the same in this case, which isn't always true).


But on these processors, I believe the frequency is rarely sustained right? Due to thermal throttling and other factors.


That only really happens on laptops, which can't dissipate as much heat as desktop systems due to size constraints. On a desktop, if you're using even AMD's stock cooler, you won't thermal throttle. That is, if you don't overclock.


Modern processors with boost configurations are rather complicated about "thermally throttling". These days with AMD's stock coolers you will be able to at least get the sticker speed on the CPU even at 100% load for a sustained time. Chances are, you'll actually get some % more speed than the sticker as it will usually continue to boost as long as power delivery and temperatures are stable. So even with an entirely stock configuration, a better motherboard and cooling system will overall net you more performance. This is without doing any traditional "overclocking" and just going with the settings designated with the CPU and motherboard. This same idea also applies to most of Intel's parts as well.


It's not about throttling. What will happen is that the CPU won't automatically clock up dynamically as much if you have worse cooling.

They behave like GPUs more and more with regards to clocks.


That's the same thing. Intel calls their stuff a dynamic boost so that some of their measurements like TDW are for lower clocks. Both CPUs end up scaling their clocks to a wide range.


>They behave like GPUs more and more with regards to clocks.

I think it's the other way around? CPUs had "boost" before GPUs.


Could be that one process spends much less time sleeping for IO thus still having the same wall clock time.

In this case, there's probably only memory IO which (afaik) cannot put a process to sleep.


There is no IO, and it is not memory intensive.


"the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time."

The reason Intel had the "per core" superiority crown for years is that it had a better IPC performance due to design efficiency. Both manufacturers are pushing against the same frequency ceiling, so if you went AMD you had to significantly increase the core count to catch up, and could never match the still important single-thread performance.

We know from large scale, comprehensive benchmarks that AMD has massively picked up the pace and is neck and neck with Intel. At the same processor speed it matches the best Intel processors.

But yeah, this article is just terrible. Not just tiny, minuscule, extremely myopic benchmarks, but then a gross over-reach with conclusions. And in the way that ignorance begets ignorance, the fact that it's trending on a couple of social news sites means that now Google is surfacing it as canonical information when it's just a junk, extremely lazy analysis.


He ran a few basic tests, and showed the results. Where was the "gross over-reach"? The article ends with a "your mileage may vary" disclaimer.


"So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!"

That is most certainly an overreach. An extraordinary overreach. Worse, it's absurdly using an AVX2 codebase, optimized for Westmere, as the baseline for "IPC" testing? The premise itself borders of gross negligence.

IPC as a generalized concept is a broad, general purpose set of instructions, not an absurdly narrow test.

Saying "Intel is faster at AVX512" is going to surprise exactly no one, and also happens to be irrelevant for the overwhelming majority of users and uses.

The microbenchmarking thing has gone on for years, and at this point anyone who has paid any attention is rightly cautious when stomping their feet and making declarations, because usually they're just pouring noise into the mix. Lazily running a couple of tiny tests is not the rigour to avoid deserved criticism.


I'm not sure if you were implying it or just using it as example of another type of unhelpful claim, but this test does not involve AVX-512.

I agree using Westmere isn't necessarily the best approach, but there is no difference in this case with either -march=native or -march=znver1.

The loop is small and simple, with only 9 instructions and compiles more or less the same regardless of march setting (I observed some basically no-op changes such as a mov and blsr swapping places). Here's the assembly (for the second test, with the bigger IPC gap):

    top:
    tzcnt  r8,rcx
    add    r8d,edx
    mov    DWORD PTR [rdi+rax*4],r8d
    mov    eax,DWORD PTR [rsi]
    inc    eax
    blsr   rcx,rcx
    mov    DWORD PTR [rsi],eax
    jne    .top


"I'm not sure if you were implying it or just using it as example of another type of unhelpful claim, but this test does not involve AVX-512."

Even worse! Is this a defense, because it's remarkably unhelpful as one.

The blog post was clearly a cry for attention for some project -- let's just use some clickbait IPC claims to gain it -- and continually alluded to a whole project -- an extreme niche project that still wouldn't have any relevance. But instead it's a meaningless, completely misrepresentative micro-loop.


My read is different than yours.

I think Daniel uses those examples because they are actual examples from projects that he is or has been working on, and he's familiar with them and actually cares about them, and because it's at least a notch more realistic than something totally synthetic.

It seems like a very roundabout thing to use as a cry for attention for SIMDjson (the project I assume you are talking about), and I don't believe that's the purpose. I see no problem in linking the project.

Picking two random benchmarks and trying to extract any kind of more general IPC claim is not on solid ground, but I'm pretty sure Daniel will say he's not doing that: he's only sharing these two specific results. That's a style that reoccurs across several entries in that blog, however, so if it triggers (as it has me on occasion) you might want to look elsewhere.


Doesn't that sentence refer only to the table above, measuring "bitset decoding" with a basic decoder, comparing 1.4 to 2.1 IPC?

It would help if the blog post had some headings to separate the benchmarks and summary.


A plain reading indicates that yes, he's only referring to the last benchmark, which showed the 2/3 disparity.


A plain reading indicates that such is irrelevant, because these are the two tiny cases that he selectively chose to demonstrate the "IPC gap" of AMD. If some AMD booster posted hand-selected micro-benchmarks that gave AMD a lead, and boasted with exclamations and pejoratives how terrible the alternative is, we would rightly question it. This deserves no more.

And to the other defense of "Well there are AMD people claiming the same in reverse, so that legitimizes this", I've seen exactly zero of those posts on here. None. They would be laughed off the site.

What we do have is that traditionally at a given frequency, per core AMD has long trailed on major benchmarks of significant, user-realistic loads. This is the the first generation in a long time where it actually doesn't, and where you don't need additional cores to make up the gap.


I feel like you are intentionally being thick in order to get mad at me.

I am only talking specifically about the 2/3 claim at the bottom of the article, which for the avoidance of doubt, is simply a summary of the final measurement made in the article, i.e., the result of dividing 1.4 by 2.1. I know this because of its positioning in the article, because the numbers line up, because a different % IPC is given for the earlier measurement, and because an earlier version of post, with different results for the last experiment (with IPC of 2.8 and 1.4), showed a different ratio (50%).

How you are somehow interpreting the small clarification of the one line which was being discussion as wide-ranging defense of the article, I'm not sure. My broader thoughts are available here [1] and the comments on the article.

---

[1] https://news.ycombinator.com/item?id=21724780


> surfacing it as canonical information when it's just a junk, extremely lazy analysis.

Isn't that what the Internet is for?


It’s depressing how many comments here are quick to dismiss the benchmarking/article. Yes, yes, memory bandwidth, I/O, and cache hierarchies are all important, but Daniel Lemire is one of the top people in the world when it comes to optimizing algorithms for modern CPUs. Do you like search engines? Lemire has made them significantly faster. He is often able to take code/algorithms that already seem fast, and make them much faster. He’s recently branched out beyond search engine core algorithms into some aspects of string processing (base64, UTF-8 validation, JSON parsing).

In this blog post, he’s paying attention to IPC because he’s typically working with inner loops where the data’s being delivered from RAM to L1 as efficiently as possible.


I have plenty of respect for Daniel (and you can even find me below in this discussion defending some aspects of this test), but I too find some fault with this article.

The main problem I have is that the claim in dispute seems to be that Zen 2 has comparable (perhaps slightly higher) IPC to Skylake, and then Daniel picks out two benchmarks and shows that Skylake has higher IPC than Zen 2... proving what exactly?

Contradicting people who said that Zen 2 had a higher IPC on every benchmark? Yes, those people were wrong, but it's easy to prove a point if you pick an argument almost no one was making it in the first place.

In the same (second) benchmark that he selected the "basic_decoder" sub-benchmark, but there is also another benchmark "bogus" which tests the empty function calling time, and this case I measure a reversed scenario: Intel at IPC 2.25 and AMD at 3.43. So should we now say that Intel IPC is "quite poor"?


Ha, I’m not referencing _your_ comments here, and I am curious about how you couldn’t reproduce his results; he’s quick to publish and seems happy to correct so we’ll see. I’m referencing more the other comments here saying things about the benchmarks not being realistic because good benchmarks need to have a mix of tasks, like memory and I/O—this ain’t a Phoronix post, folks.

I started reading through your blog last night. I’m slowly trying to learn how to go from being a programmer who doesn’t write slow code, to one who writes fast code, so absorbing a lot about vectorization and ILW, etc.


He's corrected the results (possibly even before I wrote the post you responded to this AM): they originally showed Intel at 2.8 IPC in the second table, they now show 2.1.

I measured 2.0, but I guess Daniel is using docker w/ a slightly different compiler version, so I think it the gap is sufficiently small that we can declare "close enough". I also measured quite different numbers for SKL (2.0) vs SKX (1.7), which is quite odd given the non-memory intensive behavior of the test: in that scenario, I'd expect SKL and SKX to perform identically.


>The main problem I have is that the claim in dispute seems to be that Zen 2 has comparable (perhaps slightly higher) IPC to Skylake, and then Daniel picks out two benchmarks and shows that Skylake has higher IPC than Zen 2.

Exactly my thoughts.

Casually hanging out sites that are bias towards AMD, even those never claimed Zen 2 has same or higher IPC than Intel.


The second example is just a benchmark of tzcnt, added in BMI1. It's a very specific and very bizarre benchmark to do when you could just look up the reciprocal throughput (unfortunately Zen 2 has not yet been added).

https://www.agner.org/optimize/instruction_tables.pdf

Edit: This is wrong as BeeOnRope points out below.

The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of the areas where Zen 1 was very weak is a good thing.


Zen2 is on uops.info, it's 2L0.5T on Zen, 3L1T on Intel, so slight theoretical edge for AMD (2 vs 1 uops tho).

That said, I don't agree it's a tzcnt benchmark - there are about 9 instructions only one of which is tzcnt. I'm not sure why Zen2 is worse here.


You're right, I messed that up (though I'll leave it for posterity). I went into it with a bias thinking BMI was slow on Zen, since PDEP is 18 cycles vs 1 on Skylake, much to my disappointment back in the day.

After reviewing the example again, there's no obvious reason why Zen 2 is slower, although it's likely a rare edge case. Too bad there's nothing decent like VTune on AMD platforms.

I remember one session where my choice of temporary register significantly impacted throughput while implementing an unrolled int[] hash fn on my Kaby Lake processor. I never figured out exactly why, but sharp edges do exist even on Intel chips.


This benchmark heavily stresses branch misprediction recovery, so that could be worse on Zen.

Also, I could not reproduce Daniel's results: I got IPC of 1.77 (SKX) or 2.00 (SKL) compared to Daniel's reported 2.80 (SKL, I think), so Intel still better but by a smaller margin. Waiting for clarification on that one.


I think the only real way to compare IPC is to actually talk to the architects. Trying to write microbenchmarks is a fools errand when you aren't aware of how the cpu processes the instructions you give it. Are you actually stressing the fpu, or is the cpu speculatively executing and then branch predicting the workload (common for micro loops)? If it is, is that what you meant to test? Are you trying to compare like for like (in which case you have to write assembly), or are you trying to write performance benchmarks (and then the only meaningful metric is cpu time)?

This is an interesting idea, but I'm not sure how you could derive meaning from comparing two vastly different architectures at such a high level.


Useless, strictly academic interest.

There is more than execution ports in design of processors. Not every task can be SIMD optimized to extent of approaching theoretical IPC limits, most will be bottlenecked by memory access or even IO.

I prefer the "fake" but real-world IPC. Same clocks, same real world task, measure time to finish.


I think this was more of a response to the linked benchmark at guru3d which said:

> Instructions per cycle (IPC)

> For many people, this is the holy grail of CPU measurements in terms of how fast an architecture per core really is.

Based on his work with simdjson, professor Lemire seems to be quite aware of microbenchmarks being problematic. But general articles out here and on HN are proclaiming Intel is doomed and can never recover, due to mitigations/lack of cores/lack of chiplets. Those concerns have yet to be reflected in the stock price.


Intel are behind. They have a pretty big cash buffer and a solid sales channel, as well as being pretty entrenched in OEMs. So they are a very long way from being doomed, even if it takes them a long time to turn the ship around (like 00's Microsoft).


Omar Bradley once said “Amateurs talk strategy. Professionals talk logistics.” I'd say that in CPU design amateurs talk about execution resources but professionals talk about cache hierarchies. But that's too awkward to make a good quip.


I think recommending people to to prefer {insert your favourite benchmark here} is very bad advice, and disproving your claim that Lemire's benchmarks are useless because YOU don't care about them is as simple as showing that they are useful for Lemire, which is something this post shows.

If you care enough about a particular CPU to do benchmarks, you should benchmark what YOU care about.

Lemire's job is to improve the implementation of particular algorithms to make optimal use of the hardware. Knowing the different theoretical hardware limits tells you how good an implementation is doing along different axes, and benchmarking those limits is a critical part of doing Lemire's job correctly.

You probably have a different use case for computers than Lemire, and it is therefore completely reasonable for you to care about different benchmarks.


IPC microbenchmarks do not properly reflect the complex workloads running on post Zen2 microarchitecture. Zen2 upends microarchitecture schematics enough to warrant a different metric.

IPC MB’s, in my experience, tend to benchmark best case scenarios and that is probably the exception rather than the rule for application workloads in modern MA’s. Case in point, microbenchmarks showed significant improvements in IPC for Zen2 in lieu of Skylake yet for the application workload (CPU data bound), Skylake held up neck and neck.

The more appropriate benchmarking metric for post-Zen2 processors is CPI [0].

[0] https://john.e-wilkes.com/papers/2013-EuroSys-CPI2.pdf


But isn't CPI is just reverse of IPC, and CPI just makes the IPC score being bounded between 0..1?


Heh, I'm curious if he used the mitigations for all the side channel flaws for the intel processors.


The mitigations don't affect CPU bound benchmarks [1] which don't call into the kernel or use specific user-space mitigations, so it won't matter here.

[1] There are some rare exceptions, such as https://travisdowns.github.io/blog/2019/03/19/random-writes-... , but it is unlikely to matter here.


That may have been true, but it is rather dramatically false with the new JCC erratum workaround.

It’s also false if you’re using a hypervisor that mitigates the iTLB multihit issue.


Good point, I forgot about that one, although here there is only a single hot loop with one jump so a high chance the crossing doesn't occur, and even if it does the IPC is low enough the legacy decoder probably does OK (although it adds a cycle or two to misprediction recovery, which matters here).

So it's something worth checking.

No hypervisor involved.


Smt on/off has a large effect.


It's a single threaded test, so I don't think that matters here.


True. But generally it affects cpu bound benchmarks.


SMT off might mean not enough spare threads to run the OS telemetry.


This is Linux but I don't think that would be true even on Windows.


While only being part of the performance equation, analyzing IPC can be quite interesting in understanding the design of the processor and how performance might be achieved.

One thing itches me with the presented comparison: it is running very few benchmarks generated with the same compiler. For a thorough IPC analysis, shouldn't the tests rather being programmed in assembly to exclude any influence by the compiler choice? Also probably a wider range of algorithms should be checked, as IPC on modern processors depends less on how many cycles a certain instruction takes (you should be able to find that in the manuals), but how well multiple components of the processor can be utilized at the same time. Which extremely depends on the actual program to be run.


I'm rather surprised at the claim that "but it might easily execute 7 billion instructions per second on a single core". I'd even question it except the author's an expert.

If you can keep it fed then ok but one cache miss to main mem, either instruction or data, will allow the instruction buffers to completely empty and stay empty for quite a long time. I don't think you can control placement to reasonably assure cache hits always for anything but the most trivial code, am I missing something?

Also if you could keep a consistent throughput like this I wonder if thermal throttling might have to kick in. I mean you're doing a lot of work...


I can't find it back but in a recent article I read that it was useful to have an idea of the upper-boundary abilities of an arch+algorithm, so that you 'know' what you're aiming for, but it might not be attainable practically without huge human or decades of superoptimizer effort... Yes if your algorithm reaches for cold data, you'll get hit. Can you get around that? Do you really need to hit the cache when you're computing the seven-billionth decimal of pi or factoring numbers ? This work is quite interesting, if only for compilers or superoptimizers.


I wonder how reliable are these Linux syscalls?

Found this http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_o... and that article doesn't instill much confidence in the reliability of these counters. Comment for CPU_CYCLES says "Be wary of what happens during CPU frequency scaling", comment for INSTRUCTIONS says "these can be affected by various issues, most notably hardware interrupt counts", BRANCH_INSTRUCTIONS says "Prior to Linux 2.6.34, this used the wrong event on AMD processors" and so on.

If I wanted to measure what OP was measuring, I would disable frequency scaling (probably doable on overclocker-targeted motherboards, also search finds some utilities which claim to do that, both windows and linux ones), measure time, then divide by frequency.


CPU_CYCLES counts cycles. This means that the time per cycle varies with frequency. If you're trying to see how many cycles something that fits in L1 takes, CPU_CYCLES is the right thing to measure.


Parent is pointing to documentation suggesting that it's measuring time and dividing it by frequency, and perhaps not perfectly in the case of dynamic scaling. They seem aware of what CPU_CYCLES is supposed to do.


The documentation is not the best. CPU_CYCLES is genuinely counting cycles.

perf is all about reading actual hardware counters. It's awesome for this. There is essentially nothing made up about perf's output, except to the extent that the hardware itself reports inexact output. (For example, perf annotate may attribute events to an instruction near the instruction in question on older hardware, because older hardware has a small amount of skew when sampling.)


In more comprehensive single thread benchmarks (single thread POV Ray) Intel can still beat Zen 2 architecture sometimes. This test seems to indicate the reason why.


ITT: AMD apologists.

Sorry guys but Intel is still king of single core performance. But that's not a problem because I'm sure by 2050 most desktop applications and games will correctly make use of many cores, then AMD will reign


Worth responding to blatant troll to point out it's not about performance but performance by price for 99% of uses.


Or performance per watt in server land. Which is a metric that zen 2 dominates in. Very few applications truly care about maxing performance at all costs.


But, indeed, some do. They will provide as much power and as much cooling as they need to get that performance.


By 2050 we might well be more concerned with which rocks can be slung farther.

Assuming civilization will survive until then, given current political trends, is rash.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: