> Zen 4 is AMD's first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It's not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can't think of any other reason AMD would mess with Zen 4's frontend this far into the core's lifecycle.
Indeed, it might be the case that there’s more than that disabled, since numbers are somewhat surprising:
> Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS.
Smells of microcode mitigations if you ask me, but naturally let’s wait for the CVE.
Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
Please don't post duplicate comments like this. Your first comment (https://news.ycombinator.com/item?id=42287118) was fine but spamming a thread with copy-and-pasted comments just hurts the signal to noise ratio.
Any threaded discussion carries the risk of different subthreads ending up in the same place. The simplest solution is to just not post twice, and trust that the reader can read the rest of the thread; HN threads usually don't get long enough for good comments to get too buried, and not duplicating comments helps avoid that problem. If there's something slightly different, it may be worth linking to another comment in a different subthread and adding a few sentences to cover the differences. Copying a whole comment is never a good answer, and re-wording it to obscure the fact that it's not saying anything new is also bad. New comments should have something new to say.
I would get confused handling follow-ups to both copies.
I have enough trouble if someone responds to my responses in a tone similar to GP and I end up treating them like the same person (eg, GP makes a jab and now I’m snarky 9r call out the wrong person). Especially if I have to step away to deal with life.
And, just like that, you turned the rest of this thread into a meta discussion about HN rather than about the topic. It's ironic, because that really hurt the SNR more than a duplicated, but on-topic, comment.
The countermeasure is to disable the loop buffer. Everyone who wants to protect themselves from the unknown vulnerability should disable the loop buffer. Once everyone's done that or had a reasonable opportunity to do that, it can be safely published.
There's no real impetus except paranoia if the change is unannounced. You don't have to detail the vulnerability, just inform people that somewhere, one exists, and that this is in fact a countermeasure. Without doing that, you don't shift liability, you don't actually get people out of harm's way, you don't really benefit at all.
The problem is that we're more or less stuck with this class of problem unless we end up with something that looks like a Xeon Phi without shared resources and run calculations on many, many truly independent cores, or we accept that the worst and best case performance cases are identical (which I don't foresee anyone really agreeing to).
Or, framed differently, if Intel or AMD announced a new gamer CPU tomorrow that was 3x faster in most games but utterly unsafe against all Meltdown/Spectre-class vulns, how fast do you think they'd sell out?
Larabee was fun to program, but I think it'd have an even worse time hardening memory sideband effects: the barrel processor (which was necessary to have anything like reasonable performance) was humorously easy to use for cross-process exfiltration. Like... it was so easy, we actually used it as an IPC mechanism.
Now you’re asking me technical details from more than a decade ago. My recollection is that you could map one of the caches between cores — there were uncached-write-through instructions. By reverse engineering the cache’s hash, you could write to a specific cache-line; the uc-write would push it up into the correct line and the “other core” could snoop that line from its side with a lazy read-and-clear. The whole thing was janky-AF, but way the hell faster than sending a message around the ring. (My recollection was that the three interlocking rings could make the longest-range message take hundreds of cycles.)
Sure, absolutely, there's large numbers of additional classes of side effects you would need to harden against if you wanted to eliminate everything, I was mostly thinking specifically of something with an enormous number of cores without the 4-way SMT as a high-level description.
I was always morbidly curious about programming those, but never to the point of actually buying one, and I always had more things to do in the day than time in past life when we had a few of the cards in my office.
"if Intel or AMD announced a new gamer CPU tomorrow that was 3x faster in most games but utterly unsafe against all Meltdown/Spectre-class vulns, how fast do you think they'd sell out"
Well, many people have gaming computers, they won't use for anything serious. So I would also buy it. And in restricted gaming consoles, I suppose the risk is not too high?
Also, many games today outright install rootkits to monitor your memory (see [1]) - some heartbleed is so far down the line of credible threats on a gaming machine that its outright ludicrous to trade off performance for it.
They're a pain in the ass all around. Spectre allowed you to read everything paged in (including kernel memory) from JS in the browser.
To mitigate it browsers did a bunch of hacks, including nerfing precision on all timer APIs and disabling shared memory, because you need an accurate timer for the exploit - to this day performance.now() rounds to 1MS on firefox and 0.1MS on Chrome.
This 1MS rounding funnily is a headache for me right as we speak. On a say 240Hz monitor, for video games you need to render a frame every ~4.16ms -- 1ms precision is not enough for accurate ticker -- even if you render your frames on time, the result can't be perfectly smooth as the browser doesn't give an accurate enough timer by which to advance your physics every frame.
Isn't it rather about data leaks between any two processes? Whether those two processes belong to different users is a detail of the threat model and the OS's security model. In a console it could well be about data leaks between a game with code-injection vulnerability and the OS or DRM system.
We already have heterogeneous cores these days, with E and P, and we have a ton of them as they take little space on the die relative to cache. The solution, it seems to me, is to have most cores go brrrrrr and a few that are secure.
Given that we have effectively two browser platforms (Chromium and Firefox) and two operating systems to contend with (Linux and Windows), it seems entirely tractable to get the security sensitive threads scheduled to the "S cores".
Also all the TLS, SSH, Wireguard and other encryption, anything with long-persisted secret information. Everything else, even secret (like displayed OTP codes) is likely too fleeting for a snooping attack to be able to find and exfiltrate it, even if an exfiltration channel remains. Until a better exfiltration method is found, of course :-(
I think we're headed towards the future of many highly insulated computing nodes that share little if anything. Maybe they'd have a faster way to communicate, e.g. by remapping fast cache-like memory between cores, but that memory would never be uncontrollably shared the way cache lines are now.
That's a secure enclave aka secure element aka TPM. Once you start wanting security you usually think up enough other features (voltage glitching prevention, memory encryption) that it's worth moving it off the CPU.
Eh, the TPM is a hell of a lot less functional than security processor on a modem arm board. You can seal and unseal based on system state, but once things are unsealed, it's just in memory
I agree at a gut / instinct level with that thought.
SINGLE thread best and worst case have to be the same to avoid speculation...
However for threads from completely unrelated domains could be run instead, if ready. Most likely the 'next' thread on the same unit, and worry about repacking free slots the next time the schedule runs.
++ Added ++
It might be possible to have operations that don't cross security boundaries have different performance as operations within a program's space.
An 'enhanced' level of protection for threads running a VM like guest code segment (such as browsers) might also be offered that avoids higher speculation operations.
Any operation similar to a segmentation fault relative to that thread's allowed memory accesses could result in forfeit of it's timeslice. Which would only leak what it should already know anyway, what memory it's allowed to access. Not the content of other memory segments.
Itanium allegedly was free from branch prediction issues but I suspect cache behavior still might have been an issue. Unfortunately it's also dead as a doornail.
>if Intel or AMD announced a new gamer CPU tomorrow that was 3x faster in most games but utterly unsafe against all Meltdown/Spectre-class vulns, how fast do you think they'd sell out?
I do realize that gamers aren't the most logical bunch, but aren't most games GPU-bound nowadays?
Not a gamer but I would guess it depends on the graphics settings. At lower resolutions, and with less lighting features, etc. one can probably turn a GPU bound game into a CPU bound game.
Also, a good chunk of these vulnerabilities (Retbleed, Downfall, Rowhammer, there's probably a few I'm forgetting) are either theoretical, lab-only or spear exploits that require a lot of setup. And then the leaking info from something like Retbleed mostly applies to shared machines like in cloud infrastructure.
Which makes it kind of terrible that the kernel has these mitigations turned on by default, stealing somewhere in the neighborhood of 20-60% of performance on older gen hardware, just because the kernel has to roll with "one size fits all" defaults.
I don’t think you are thinking of this right. One bit of leakage makes it half as hard to break encryption via brute force. It’s a serious problem. The defaults are justified.
I think things will only shift once we have systems they ship with fully sandboxes that are minimally optimized and fully isolated. Until then we are forced to assume the worst.
> I don’t think you are thinking of this right. One bit of leakage makes it half as hard to break encryption via brute force.
The problem is that you need to execute on the system, then need to know which application you’re targeting, then figure out the timings, and even then you’re not certain you are getting the bits you want.
Enabling mitigations For servers? Sure. Cloud servers? Definitely. High profile targets? Go for it.
The current defaults are like foisting iOS its “Lockdown Mode” on all users by default and then expecting them to figure out how to turn it off, except you have to do it by connecting it to your Mac/PC and punching in a bunch of terminal commands.
Then again, almost all kernel settings are server-optimal (and even then, 90s server optimal). There should honestly should be some serious effort to modernize the defaults for reasonably modern servers, and then also have a separate kernel for desktops (akin to CachyOS, just more upstream).
Maybe so but I think most users are going to be vulnerable to likely under-estimate their security-sensitivity than to over-estimate. On top of that security profiles can change and perhaps people won’t remember to update their settings to meet their current security needs.
These defaults are needed and if the loss is so massive we should be willing to embrace less programmable but more secure options.
I imagine this is more of a functional issue. i.e., the loop buffer caused corruption of the instruction stream under some weird specific circumstances. Spectre and Meltdown are not functional issues but rather just side channel issues.
This should be fun, however, for someone with enough time to chase down and try and find the bug. Depending on the consequences of the bug and the conditions under which it hits, maybe you could even write an exploit (either going from JavaScript to the browser or from user mode to the kernel) with it :) Though, I strongly suspect that reverse engineering and weaponizing the bug without any insider knowledge will be exceedingly difficult. And, anyways, there's also a decent chance this issue just leads to a hang/livelock/MCE which would make it pointless to exploit.
It depends on the severity of the problem, and the impact on the customers already using these systems. It may be more economical for the customer to apply a patch and lose a few percent of peak performance than to put thousands of boxes offline and schedule personnel to swap CPUs. This is to say nothing of the hassle of bringing your new laptop to a service center, and taking a replacement, or waiting if your exact configuration is unavailable at the moment.
For most of these vulnerabilities the risk is low, but keep in mind that your web browser runs random untrusted code from all over the Internet in a VM with a JIT compiler. This means you can't rule out the possibility that someone will figure out a way to exploit this over the web reliably, which would be catastrophic.
Right, but there's presumably little data to worry about exfiltrating from, say, my xbox. Even stuff like billing info doesn't need to be stored locally.
I am not convinced either but I am willing to bet some software is adversarial and will try to exfiltrate data. F.ex. many people look suspiciously at Zoom and Chrome.
So as long as stuff is not perfectly isolated from each other then there's always a room for a bad actor to snoop on stuff.
This might come as a shock, but I can assure you that the designing high end microprocessors have probably forgotten more about these topics than most of the people here have ever known.
The article seems to suggest that the loop buffer provides no performance benefit and no power benefit.
If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".
I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.
In both cases, the project shouldn't have shipped.
> but was shipped anyway, possibly so someone could save face
Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.
What you describe would be shipped physically but disabled, and that certainly happens a lot. For exactly those reasons. What GP described was shipped not only physically present but also not even disabled, because politics. That would be a very different thing.
No kidding. I was adjacent to a tape out w some last minute tweaks - ugh. The problem is the current cycle time is very slow and costly and u spend as much time validating things as you do designing. It’s not programming.
Once interviewed at a place which made sensors that was used a lot in the oil industry. Once you put a sensor on the bottom of the ocean 100+ meters (300+ feet) down, they're not getting serviced any time soon.
They showed me the facilities, and the vast majority was taken up by testing and validation rigs. The sensors would go through many stages, taking several weeks.
The final stage had an adjacent room with a viewing window and a nice couch, so a representative for the client could watch the final tests before bringing the sensors back.
Quite the opposite to the "just publish a patch" mentality that's so prevalent these days.
If you work on a critical piece of software (especially one you can't update later), you absolutely can spend way more time validating than you do writing code.
The ease of pushing updates encourages lazy coding.
> The ease of pushing updates encourages lazy coding.
Certainly in some cases, but in others, it just shifts the economics: Obviously, fault tolerance can be laborious and time consuming, and that time and labor is taken from something else. When the natures of your dev and distribution pipelines render faults less disruptive, and you have a good foundational codebase and code review process that pay attention to security and core stability, quickly creating 3 working features can be much, much more valuable than making sure 1 working feature will never ever generate a support ticket.
Even for software it’s often risky to remove code once it’s in there. Lots of software products are shipped with tons of unused code and assets because no one’s got time to validate nothing’s gonna go wrong when you remove them. Check out some game teardowns, they often have dead assets from years ago, sometimes even completely unrelated things from the studio’s past projects.
The article also mentions they had trouble measuring power usage in general so we can't necessarily (and, really, shouldn't) conclude that it has no impact whatsoever. I highly doubt that AMD's engineering teams are so unprincipled as to allow people to add HW features with no value (why would you dedicate area and power to a feature which doesn't do anything?), and so I'm inclined to give them the benefit of the doubt here and assume that Chips 'n Cheese simply couldn't measure the impact.
Note - I saw the article through from start to finish. For power measurements I modified my memory bandwidth test to read AMD's core energy status MSR, and modified the instruction bandwidth testing part to create a loop within the test array. (https://github.com/clamchowder/Microbenchmarks/commit/6942ab...)
Remember most of the technical analysis on Chips and Cheese is a one person effort, and I simply don't have infinite free time or equipment to dig deeper into power. That's why I wrote "Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out."
Sorry, I totally didn't mean this as a slight to your work--I've been a fan for quite a while :)
More so that estimating power when you don't have access to post synthesis simulations or internal gas gauges is very hard. For something so small, I can easily see this being a massive pain to measure in the field and the kind of thing that would easily vanish into the noise on a real system.
But in the absence of any clear answer, I do think it's reasonable to assume that the feature does in fact have the power advantages AMD intended, even if small.
> engineering teams are so unprincipled as to allow people to add HW features with no value
This is often pretty common, as the performance characteristics are often unknown until late in the hardware design cycle - it would be "easy" if each cycle was just changing that single unit with everything else static, but that isn't the case as everything is changing around it. And then by the time you've got everything together complete enough to actually test end-to-end pipeline performance, removing things is often the riskier choice.
And that's before you even get to the point of low-level implementation/layout/node specific optimizations, which can then again have somewhat unexpected results on frequency and power metrics.
Working at.. a very popular HW company.. I'll say that we(the SW folks) are currently obsessed with 'doing something' even if the thing we're doing hasn't fully been proven to have benefits outside of some narrow use cases or targeted benchmarks. It's very frustrating, but no one wants to put the time in to do the research up front. It's easier to just move forward with a new project because upper management stays happy and doesn't ask questions.
Is it that expectation of major updates coming in at a fixed cycle? Not only expected by upper management but also by end users? That's a difficult trap to get out of.
I wonder if that will be the key benefit of Google's switch to two "major" Android releases each year: it will get people used to nothing newsworthy happening within a version increment. And I also wonder if that's intentional, and my guess is not the tiniest bit.
Yeah, we've made great progress and folks are used to it. Now we've got to deliver but most of the low-hanging fruit has been picked(some of it also incurred tech debt).
Do you have new software managers/directors who are encouraging such behavior? From my experience new leaders tend to lean on this tactics to grab power.
Strangely no. Our management hasn't really changed in several years. Expectations have risen though and we've picked a lot of the low-hanging fruit. We also failed to invest in our staffing and so we don't have enough experienced devs to actually do the work now.
Well the other possibility is that the power benchmarks are accurate: the buffer did save power, but then they figured out an even better optimization on the microcodes level that would make the regular path save even more power, so the buffer actually became a power hog.
>> when the project is done, there are more lines of code and performance is worse
There is an added benefit though - that the new programmers now are fluent in the code base. That benefit might be worth more than LOCs or performance.
"The article seems to suggest that the loop buffer provides no performance benefit and no power benefit."
It tests the performance benefit hypothesis in different scenarios and does not find evidence that supports it. It makes one best effort attempt to test the power benefit hypothesis and concludes it with: "Results make no sense."
I think the real take-away is that performance measurements without considering power tell only half the story. We came a long way when it comes to the performance measurement half but power measurement is still hard. We should work on that.
Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.
Someone elsewhere quotes a game specific benchmark of about 15%. Which will mostly matter when your FPS starts to make game play difficult.
There will be a certain number of people who will delay an upgrade a bit more because the new machines don’t have enough extra oomph to warrant it. Little’s Law can apply to finance when it’s interval between purchases.
For me the most interesting paragraph in the article is:
> Perhaps the best way of looking at Zen 4's loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn't go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.
> Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, […]
With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…
He didn’t provide enough detail here. The second CCD on a Ryzen chip is not as well binned as the first one even on. non-X3D chips. Also, EVERY chip is different.
Most of the cores on CCD0 of my non-X3D chip hit 5.6-5.75ghz. CCD 1 has cores topping out at 5.4-5.5ghz.
V-Cache chips for Zen 4 have a huge clock penalty, however the Cache more than makes up for it.
Did he test CCD1 on the same chip with both the feature disabled and enabled? Did he attempt to isolate other changes like security fixes as well? He admitted “no” in his article.
The only proper way to test would be to find a way to disable the feature on a bios that has it enabled and test both scenarios across the same chip, and even then the result may still not be accurate due to other possible branch conditions. A full performance profile could bring accuracy, but I suspect only an AMD engineer could do that…
He mentioned that it was disabled somewhere between the two UEFI versions he tested. Presumably there are other changes included, so his measurements are not strict A/B testing.
It sounds to me like it was too small to make any real difference except in very specific scenarios and a larger one would have been too expensive to implement compared to the benefit.
That being said, some workloads will see a small regression, however AMD has made some
small performance improvements since launch.
They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.
Them *quietly* disabling a feature that few users will notice yet complicates the frontend suggests they pulled this chicken bit because they wanted to avoid or delay disclosing a hardware bug to the general public, but already push the mitigation. Fucking vendors! Will they ever learn? sigh
Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
Much more importantly they fixed the MMU support. The original 68000 lost some state required to recover from a page fault the workaround was ugly and expensive: run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU. Apparently it was still cheaper than the alternatives at the time if you wanted a CPU with MMU, a 32 bit ISA and a 24 bit address bus. Must have been a wild time.
> run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU.
That's not quite how it was implemented.
Instead, the second 68000 was halted and disconnected from the bus until the first 68000 (the executor) trigged a fault. Then the first 68000 would be held in halt, disconnected from the bus and the second 68000 (the fixer) would take over the bus to run the fault handler code.
After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.
As for the cost of a second 68000, extra logic and larger PCBs? Well, the of the Motorola 68451 MMU (or equivalent) absolutely dwarfed the cost of everything else, so adding a second CPU really wasn't a big deal.
Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.
While this executor + fixer setup does work for most usecases, it's still impossible to recover the state. The relevant state is simply held in the halted 68000.
Which means, the only thing you can do is handle the fault and resume. If you need to page something in from disk, userspace is entirely blocked until the IO request completes. You can't go and run another process that isn't waiting for IO.
I suspect it also makes it impossible to correctly implement POSIX segfault signal handlers. If you try to run it on the executor, then the state is cleared and it's not valid to return from the signal handler anymore.
If you run the handler on the fixer instead, then you are running in a context without pagefaults, which would be disastrous if the segfault handler access code or data that has been paged out. And the now segfault handler wouldn't have access to any of the executor's CPUs state.
------
So there is merit to the idea of running two 68000s in lockstep. That would theoretically allow you to recover the full state.
But there is a problem: It's not enough to run the second 68000 one cycle behind.
You need to run it one instruction behind, putting all memory read data and wait-states into a FIFO for the second 68000 to consume. And 68000 instructions have variable execution time, so I guess the delay needs to be the length of the longest possible instruction (which is something like 60 cycles).
But what about pipelining? That's the whole reason why can't recover the state in the first place. I'm not sure, but it might be necessary to run an entire 4 instructions behind, which would mean something like 240 cycles buffered in that FIFO.
This also means your fault handler is now running way too soon. You will need to emulate 240 cycles worth of instructions in software until you find the one which triggered the page fault.
I think such an approach is possible, but it really doesn't seem sane.
--------
I might need to do a deeper dive into this later, but I suspect all these early dual 68000 Unix workstations simply dealt with the issues of the executor/fixer setup and didn't implement proper segfault signal handlers. It's reasonably rare for programs to do anything in a segfault handler other than print a nice crash message.
Any unix program that did fancy things in segfault handlers weren't portable, as many unix systems didn't have paging at all. It was enough to have a memory mapper with a few segments (base, size, and physical offset).
That's neat. For small loop buffers, I quite like the GreenArrays forth core. It has 18 bit words that hold 4 instructions each, and one of the opcodes decrements a loop counter and goes back to the start of the word. And it can run appreciably faster while it's doing that.
The loop buffer on the 68010 was almost useless, because not only was it only 6 bytes, it only held two instructions. One had to be the loop instruction (DBcc), so the loop body had to be a single instruction. Pretty much the only thing it could speed up in practice was an unoptimized memcpy.
could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:
- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|
- 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) at hilarious 7 cycles per byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Afaik no loop buffer meant its re-fetching whole instruction every loop.
- 1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s
I'm curious about this too. I would expect any RISC architecture to gain relatively little from a loop buffer. The point of RISC is that instruction fetch/decode is substantially easier, if not trivial.
Interesting read, one thing I don’t understand is how much space does loop buffer take on the die? I’m curious with it removed, on future chips could you use the space for something more useful like a bigger L2 cache?
I think most modern chips are routing constrained and not floorspace constrained. You can build tons of features but getting them all power and normalized signals is an absolute chore.
My understanding is that it's a pretty small optimization on the front end. It doesn't have a lot of entries to begin with (144) so the amount of space saved is probably negligible. Theoretically, the loop buffer would let you save power or improve performance in a tight loop. In practice, it doesn't seem to do either, and AMD removed it completely for Zen 5.
Judging from the diagrams, the loop buffer is using the same storage as the micro-op queue that's there anyway. If that is accurate (and it does seem plausible), then the area cost is just some additional control logic. I suspect the most expensive part is detecting a loop in the first place, but that's probably quite small compared to the size of the queue.
It says 144 micro-op entries per core. Not sure how many bytes that is, but L2 caches these days are around 1MB per core, so assuming the loop buffer die space is mostly storage (sounds like it) then it wouldn't make a notable difference.
In the "power" section, it seems the analysis doesn't divide by the number of instructions executed per second.
Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).
Every instruction takes a different amount of clock cycles (and this varies between architectures or iterations of an architecture such as Zen 4-Zen 5), so that is not feasible unless running the workload produced the exact same instructions per cycle, which is impossible due to multi threading/tasking. Even order and the contents of RAM matters since both can change everything.
While you can somewhat isolate for this by doing hundreds of runs for both on and off, that takes tons of time and still won’t be 100% accurate.
Even disabling the feature can cause the code to use a different branch which may shift everything around.
I am not specifically familiar with this issue, but I have seen cases where disabling a feature shifted the load from integer units to the FPU or the GPU as an example, or added 2 additional instructions while taking away 5.
If it saved power wouldn’t that lead to less thermal throttling and thus improved performance? That power had to matter in the first place or it wouldn’t have been worth it in the first place.
Not necessarily. Let's say this optimization can save 0.1w in certain situations. If one of those situations is common when the chip is idle just keeping wifi alive, well hey that's 0.1w in a ~1w total draw scenario, that's 10% that's huge!
But when the CPU is pulling 100w under load? Well now we're talking an amount so small it's irrelevant. Maybe with a well calibrated scope you could figure out if it was on or not.
Since this is in the micro-op queue in the front end, it's going to be more about that very low total power draw side of things where this comes into play. So this would have been something they were doing to see if it helped for the laptop skus, not for the desktop ones.
You're probaly right on the mark with this. Though even desktops and servers can benefit from lower idle power draw. So there is a chance that it might have been moved to a different c-state.
There were CPUs with whole plethora of optional optimizations. For example Cyrix packed their CPUs with goodies, but had no money to test so made it all optional.
L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.
Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.
"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."
The problem is not agreeing or not. The problem is posting uninformed opinions without reading the story, which explains in detail that it is both true and not that important (i.e., not "huge").