The vulnerability is speculative execution, not branch prediction. The branch predictor is the thing you have to trick to force the processor to speculatively execute code in the victim program. Furthermore you also need a valid timing source to read out the results of the speculative execution.
As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.
Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.
The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.
> Speculative execution is so valuable for performance that a computer without it is completely unusable.
Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.
That's from a designer making high performance RISC-V CPUs.
This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.
ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.
> limited by the need to discover instruction boundaries
Keller specifically addressed this. His view is that yes, the x86 instruction width problem is real, but it's not an important performance bottleneck because decoders are now sophisticated enough that other bottlenecks (predictors) dominate the net performance. Yes, this means more power usage and die area, but only a part of the market is sensitive to this: where other considerations dominate the cost premium of x86 can be ignored.
He also believes that "RISC-V is the future of computing," (not surprising from the CEO of a RISC-V vendor,) so it isn't as if the legacy ISAs are somehow just fine. But ISA complexity isn't the key determinant in that future. The keys are the open ISA, commoditization of designs and high software portability.
This makes sense to me. x86, despite it's inherent issues, has fought off alternatives before. But now, as per Keller, there is a Cambrian explosion of RISC-V designs appearing, and what emerges from that is going to prevail against any legacy ISA, not just x86. Ulitimately, therefore, the x86 "tax" doesn't actually matter.
> Big ARM cores have much larger decode width than x86 cores
Not in general they don't. Apple's does, and Qualcomm's newest Snapdragon X Elite also does. But most big ARM cores don't have larger decode widths than x86 cores. The Cortex-A78 is a 4-wide decode, same as the majority of x86 CPUs on the market. ARM's latest & greatest Cortex X2 is only a 5-wide decode, it only just finally surpassed the original Zen design (4-wide).
Also Zen 5 bumps to an 8-wide decode, the same as what Apple & Qualcomm's best ARM chips can do.
This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.
RISC was not just about total instructions, but the complexity of those instructions. It's fundamental conceit is that every instruction should ideally take just one cycle to execute. Intel's iAPX432 is one of the most CISCy designs out there and a great case that ISA does matter. It had instructions for stuff like data structures, OOP, and garbage collection. These things could take LOADS of cycles to execute.
In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.
Thumb-2, Neon, VFPv4-D16, VFPv4 are mandatory instruction extension sets for ARM64, but there are also optional, widely implemented extensions like AES and SVE. Ain't nothing "reduced" about the ARM CPUs' instruction sets anymore.
> In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.
integer spec is the keyword here. ARM CPUs these days implement much more than integer spec. "If I ignore everything else, it's RISC" is not a good argument IMHO.
> Thumb-2, [...] are mandatory instruction extension sets for ARM64
Isn't Thumb-2 for 32-bit ARM only? AFAIK, having 32-bit ARM is optional, you can have 64-bit only ARM nowadays.
(And 64-bit ARM no longer has what is IMO the least RISC instructions from 32-bit ARM: the "load multiple" and "store multiple" instructions, which had a built-in loop doing memory accesses to an arbitrary set of registers, were replaced by "load pair" and "store pair", which do a single memory access and work with a fixed number of registers. If you look at https://www.righto.com/2016/01/more-arm1-processor-reverse-e... you can see that the circuitry dedicated for these instructions took a lot of space in the original 32-bit ARM core.)
ARM64 threw out the book and started over. Stuff like Thumb-2 or VFPvX are for legacy ISA. There aren't variable instruction lengths and NEON is baked into the core of the ISA.
AES in ARM (and x86 as this was after their attempt to be more RISC-like) was designed to be RISC. A CISC approach would have just ONE instruction that would take the data pointer and number of rounds then do everything at one time. In contrast, x86 and ARM use a couple simple setup instructions, then an AES round instruction (1-2 cycles) that gets called repeatedly then a couple more finalizing instructions.
If you look at the code on any machine, something like 95+% of it is integer code. Integers are fundamental and everything else is a performance optimization.
The float/vector pipeline is a different thing, but it still takes a RISC appoach. 1-5 cycles for most things, a few are 6-7, and a couple like divide and some types of sqrt can take up to a dozen or so cycles (there's a reason they are avoided). This all happens in its own separate pipeline with its own scheduler. All the ideas of being superscalar and avoiding bubbles apply here too. Even the worst case is a far cry from CISC chips where instructions could take dozens or even hundreds of cycles.
While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.
I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.
Memory models are indeed important. That's why Apple extended ARM with total store order capability. A CPU can efficiently implement more than one memory model.
> How would you go about making that pile of garbage into something fast?
Make a Rosetta analog and translate to an instruction set that is amenable to efficient prediction, wide dispatch, etc., replacing all the iAPX hardware object oriented nonsense with conventional logic. If necessary, extend the CPU to accommodate whatever memory ordering behavior is necessary for efficient execution, ah la Apple Silicon ARM with x86 total store ordering.
> Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.
Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).
The wiki is wrong or at least misleading. Branch prediction is a form of speculative execution.
Even 486 (and possibly 386) had branch prediction (although a trivial one).
P6 was a huge deal because it added out of order execution.
Edit: I guess it is a matter of semantics: in the classical 5 stage in order RISC, instructions after the speculated branch are fetched, decoded, etc, but they won't reach the execution stage before the speculation is resolved, so only the branch is technically "executed" speculatively at the fetch stage. So there is less state to unwind, compared to a true OoO machine that can run ahead.
according to this thing I found [0] the 486 has a predictor for the branch not taken. Not sure what that means but it looks like it's mostly for the instruction fetch/decode based on other notes. Ironically sounds close to your 5 stage RISC description, extra ironic because 486 is 5 stages too...
P5 has a more 'real' BP but also had the U/V Pipelining.
But yes as far as OoO goes, I think for x86 it was Nx586 first, then PPro (P6), then Cyrix 6x86 [1] and AMD K6 (courtesy of Nextgen tech) the next year.
As far as I know 486 predicted all branches as not taken, so it would fetch and decide instructions after a branche and throw that away when the speculation was proven wrong.
I guess calling this speculative execution is a stretch.
As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.
Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.
The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.
[0] https://hackaday.com/2013/08/02/the-mill-cpu-architecture/
[1] At least not without the target origin allowing it via CORS
[2] e.g. hotlink images or embed iframes