Intel chips are RISC under the hood these days (for a long while - decade or mor...

mcbain · on Dec 11, 2018

The idea that Intel is “RISC under the hood” is too simplistic.

Instructions get decoded and some end up looking like RISC instructions, but there is so much macro- and micro-op fusion going on, as well as reordering, etc, that it is nothing like a RISC machine.

(The whole argument is kind of pointless anyway.)

rbanffy · on Dec 11, 2018

With all the optimizations going on, high performance RISC designs don't look like RISC designs anymore either. The ISA has very little to do with whatever the execution units actually see or execute.

rocqua · on Dec 11, 2018

It is baffling to me that byte-code is essentially a 'high level language' these days.

rbanffy · on Dec 11, 2018

And yet, when I first approached C, it was considered a "high level" language.

bogomipz · on Dec 11, 2018

Because the functional units are utilizing microcode? Or do you mean something else?

Oreb · on Dec 11, 2018

Possibly stupid questions from someone completely ignorant about hardware:

If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer? If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

Alternatively, could Intel theoretically implement a new, non-x86 ASM layer that would decode down to better optimized microcode?

lmm · on Dec 11, 2018

> If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer?

The microcode is the CPU firmware that turns x86 (or whatever) instructions into micro-ops. In theory if you knew about all the internals of your CPU you could upload your own microcode that would run some custom ISA (which could be straight micro-ops I guess).

> If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

The most important concern for modern code performance is cache efficiency, so x86-style instruction sets actually lead to better performance than a vanilla RISC-style one (the complex instructions act as a de facto compression mechanism) - compare ARM Thumb.

Instruction sets that are more efficient than x86 are certainly possible, especially if you allowed the compiler to come up with a custom instruction set for your particular program and microcode to implement it. (It'd be like a slightly more efficient version of how the high-performance K programming language works: interpreted, but by an interpreter designed to be small enough to fit into L1 cache). But we're talking about a small difference; existing processors are designed to implement x86 efficiently, and there's been a huge amount of compiler work put into producing efficient x86.

earenndil · on Dec 11, 2018

CISC is still an asset rather than a liability, though, as it means you can fit more code into cache.

chroma · on Dec 11, 2018

I don't think that's an advantage these days. The bottleneck seems to be decoding instructions, and that's easier to parallelize if instructions are fixed width. Case in point: The big cores on Apple's A11 and A12 SoCs can decode 7 instructions per cycle. Intel's Skylake can do 5. Intel CPUs also have μop caches because decoding x86 is so expensive.

jabl · on Dec 11, 2018

Maybe the golden middle path is compressed risc instructions. E.g the risc-v C extension, where the most commonly used instructions take 16 bits, and the full 32-bit instructions are still available. Density is apparently slightly better than x86-64, while being easier to decode.

(yes, I'm aware there's no high performance risc-v core available (yet) comparable to x86-64 or power, or even the higher end arm ones)

gpderetta · on Dec 11, 2018

Sure, but intel CISC instructions can do more, so in the end is a wash.

chroma · on Dec 11, 2018

That's not the case. Only one of Skylake's decoders can translate complex x86 instructions. The other 4 are simple decoders, and can only transform a simple x86 instruction into a single µop. At most, Skylake's decoder can emit 5 µops per cycle.[1]

1. https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

jawnv6 · on Dec 11, 2018

... so what? most code's hot and should be issued from the uop cache at 6uop/cl with "80%+ hit rate" from your source

you're really not making the case that "decode" is the bottleneck, are you unaware of the mitigations that x86 designs have taken to alleviate that? or are those mitigations your proof that the ISA's deficient

Symmetry · on Dec 11, 2018

That really isn't true in the modern world. x86 has things like load-op and large inline constants but ARM has things like load or store multiple, predication, and more registers. They tend to take about the same number of instructions per executable and about the same number of bytes per instruction.

If you're comparing to MIPS then sure x86 is more efficient. And x86 is instruction do more than RISC-V but most high performance RISC-V uses instruction compression and front end fusion for similar pipeline and cache usage.

gpderetta · on Dec 11, 2018

(generalized) predication is not a thing in ARM64. Is Apple cpu 7 wide even in 32 bit mode?

It is true though, as chroma pointed out, that intel can't decode load-op as full width.

wtallis · on Dec 11, 2018

You can fit more code into the same sized cache, but you also need an extra cache layer for the decoded µops, and a much more complicated fetch/decode/dispatch part of the pipeline. It clearly works, at least for the high per-core power levels that Intel targets, but it's not obvious whether it saves transistors or improves performance compared to having an instruction set that accurately reflects the true execution resources, and just increasing the L1i$ size. Ultimately, only one of the strategies is viable when you're trying to maintain binary compatibility across dozens of microarchitecture generations.

gpderetta · on Dec 11, 2018

The fact is that a postdecode cache is desirable even on an high performance RISC design as even there skipping decode and fetch is desirable for both performance and power usage.

IBM Power9 for example has a predecode stage before L1.

You could say that, in general, riscs can get away without extra complexity for a longer time while x86 must implement it early (this is also true for example for memory speculation due to the more restrictive intel memory model, or optimized hardware TLB walkers), but in the end it can be an advantage for x86 (more mature implementation).

Symmetry · on Dec 11, 2018

In theory, yes. In practice x86-64, while it was the right solution for the market, isn't a very efficient encoding and doesn't fit any more code in cache than pragmatic RISC designs like ARM. It still beats more purist RISC designs like MIPS but not by as much as pure x86 did.

It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock. But legacy compatibility means that that scheme will not be x86 based.

bogomipz · on Dec 11, 2018

>"It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock."

How might a self-synchronizing encoding scheme work? How could a decoder be divorced from the clock pulse? I am intrigued by this idea.

Symmetry · on Dec 11, 2018

What I mean is self-synchronizing like UTF-8. For example the first bit of a byte being 1 if its the start of an instruction and 0 otherwise. Just enough to know where the instruction starts are without having to decode the instructions up to that point and so that a jump to an address that's the middle of an instruction can raise a fault. Checking the security of x86 executables can be hard sometimes because reading a string of instructions started from address FOO will give you a stream of innocuous instructions whereas reading starting at address FOO+1 will give you a different stream of instructions that does something malicious.

jawnv6 · on Dec 11, 2018

sure, so what's your 6 byte equivalent ARM for FXSAVE/FXRSTOR?

n-gatedotcom · on Dec 11, 2018

What is an example of a commonly used complex instruction that is "simplified"/DNE in RISC? (in asm, not binary)

gpderetta · on Dec 11, 2018

Load-op instructions do not normally exist on RISC, but are common on CISC.

adrianN · on Dec 11, 2018

You do have to worry about the µ-op cache nowadays.