The vulnerability is speculative execution, not branch prediction. The branch pr...

topspin · on July 27, 2024

> Speculative execution is so valuable for performance that a computer without it is completely unusable.

Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.

That's from a designer making high performance RISC-V CPUs.

fanf2 · on July 27, 2024

This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.

ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.

topspin · on July 31, 2024

> limited by the need to discover instruction boundaries

Keller specifically addressed this. His view is that yes, the x86 instruction width problem is real, but it's not an important performance bottleneck because decoders are now sophisticated enough that other bottlenecks (predictors) dominate the net performance. Yes, this means more power usage and die area, but only a part of the market is sensitive to this: where other considerations dominate the cost premium of x86 can be ignored.

He also believes that "RISC-V is the future of computing," (not surprising from the CEO of a RISC-V vendor,) so it isn't as if the legacy ISAs are somehow just fine. But ISA complexity isn't the key determinant in that future. The keys are the open ISA, commoditization of designs and high software portability.

This makes sense to me. x86, despite it's inherent issues, has fought off alternatives before. But now, as per Keller, there is a Cambrian explosion of RISC-V designs appearing, and what emerges from that is going to prevail against any legacy ISA, not just x86. Ulitimately, therefore, the x86 "tax" doesn't actually matter.

kllrnohj · on July 28, 2024

> Big ARM cores have much larger decode width than x86 cores

Not in general they don't. Apple's does, and Qualcomm's newest Snapdragon X Elite also does. But most big ARM cores don't have larger decode widths than x86 cores. The Cortex-A78 is a 4-wide decode, same as the majority of x86 CPUs on the market. ARM's latest & greatest Cortex X2 is only a 5-wide decode, it only just finally surpassed the original Zen design (4-wide).

Also Zen 5 bumps to an 8-wide decode, the same as what Apple & Qualcomm's best ARM chips can do.

The_Colonel · on July 27, 2024

This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.

hajile · on July 28, 2024

What about ARM64 is not RISC?

RISC was not just about total instructions, but the complexity of those instructions. It's fundamental conceit is that every instruction should ideally take just one cycle to execute. Intel's iAPX432 is one of the most CISCy designs out there and a great case that ISA does matter. It had instructions for stuff like data structures, OOP, and garbage collection. These things could take LOADS of cycles to execute.

In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

The_Colonel · on July 28, 2024

Thumb-2, Neon, VFPv4-D16, VFPv4 are mandatory instruction extension sets for ARM64, but there are also optional, widely implemented extensions like AES and SVE. Ain't nothing "reduced" about the ARM CPUs' instruction sets anymore.

> In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

integer spec is the keyword here. ARM CPUs these days implement much more than integer spec. "If I ignore everything else, it's RISC" is not a good argument IMHO.

cesarb · on July 28, 2024

> Thumb-2, [...] are mandatory instruction extension sets for ARM64

Isn't Thumb-2 for 32-bit ARM only? AFAIK, having 32-bit ARM is optional, you can have 64-bit only ARM nowadays.

(And 64-bit ARM no longer has what is IMO the least RISC instructions from 32-bit ARM: the "load multiple" and "store multiple" instructions, which had a built-in loop doing memory accesses to an arbitrary set of registers, were replaced by "load pair" and "store pair", which do a single memory access and work with a fixed number of registers. If you look at https://www.righto.com/2016/01/more-arm1-processor-reverse-e... you can see that the circuitry dedicated for these instructions took a lot of space in the original 32-bit ARM core.)

hajile · on July 28, 2024

ARM64 threw out the book and started over. Stuff like Thumb-2 or VFPvX are for legacy ISA. There aren't variable instruction lengths and NEON is baked into the core of the ISA.

AES in ARM (and x86 as this was after their attempt to be more RISC-like) was designed to be RISC. A CISC approach would have just ONE instruction that would take the data pointer and number of rounds then do everything at one time. In contrast, x86 and ARM use a couple simple setup instructions, then an AES round instruction (1-2 cycles) that gets called repeatedly then a couple more finalizing instructions.

If you look at the code on any machine, something like 95+% of it is integer code. Integers are fundamental and everything else is a performance optimization.

The float/vector pipeline is a different thing, but it still takes a RISC appoach. 1-5 cycles for most things, a few are 6-7, and a couple like divide and some types of sqrt can take up to a dozen or so cycles (there's a reason they are avoided). This all happens in its own separate pipeline with its own scheduler. All the ideas of being superscalar and avoiding bubbles apply here too. Even the worst case is a far cry from CISC chips where instructions could take dozens or even hundreds of cycles.

ribit · on July 27, 2024

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

scheme271 · on July 27, 2024

I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.

topspin · on July 29, 2024

Memory models are indeed important. That's why Apple extended ARM with total store order capability. A CPU can efficiently implement more than one memory model.

hajile · on July 28, 2024

Look into Intel's iAPX and tell me that ISA doesn't matter. How would you go about making that pile of garbage into something fast?

Most CISC designs died. A large part of x86's survival is because it wasn't exceptionally CISC.

topspin · on July 28, 2024

> How would you go about making that pile of garbage into something fast?

Make a Rosetta analog and translate to an instruction set that is amenable to efficient prediction, wide dispatch, etc., replacing all the iAPX hardware object oriented nonsense with conventional logic. If necessary, extend the CPU to accommodate whatever memory ordering behavior is necessary for efficient execution, ah la Apple Silicon ARM with x86 total store ordering.

toast0 · on July 27, 2024

> Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).

pezezin · on July 27, 2024

According to Wikipedia, the original Pentium had branch prediction, but speculative execution was first implemented by the Pentium Pro: https://en.wikipedia.org/wiki/P6_(microarchitecture)#Feature...

gpderetta · on July 27, 2024

The wiki is wrong or at least misleading. Branch prediction is a form of speculative execution.

Even 486 (and possibly 386) had branch prediction (although a trivial one).

P6 was a huge deal because it added out of order execution.

Edit: I guess it is a matter of semantics: in the classical 5 stage in order RISC, instructions after the speculated branch are fetched, decoded, etc, but they won't reach the execution stage before the speculation is resolved, so only the branch is technically "executed" speculatively at the fetch stage. So there is less state to unwind, compared to a true OoO machine that can run ahead.

to11mtm · on July 27, 2024

according to this thing I found [0] the 486 has a predictor for the branch not taken. Not sure what that means but it looks like it's mostly for the instruction fetch/decode based on other notes. Ironically sounds close to your 5 stage RISC description, extra ironic because 486 is 5 stages too...

P5 has a more 'real' BP but also had the U/V Pipelining.

But yes as far as OoO goes, I think for x86 it was Nx586 first, then PPro (P6), then Cyrix 6x86 [1] and AMD K6 (courtesy of Nextgen tech) the next year.

[0] - https://people.computing.clemson.edu/~mark/330/colwell/case_...

[1] - Worth noting the Cyrix Coma bug, which was a way to express a flaw in the pipeline. https://en.wikipedia.org/wiki/Cyrix_coma_bug

gpderetta · on July 28, 2024

As far as I know 486 predicted all branches as not taken, so it would fetch and decide instructions after a branche and throw that away when the speculation was proven wrong.

I guess calling this speculative execution is a stretch.

gpderetta · on July 27, 2024

The first pentium had branch prediction so of course it had speculative execution.

What it didn't have is out of order execution, which greatly increases the speculation window.