*Intel has always been a process first, design second company. The company was f...

n-gatedotcom · on Dec 11, 2018

I don't know what country you're in but in cricket there's a concept of innings and scoring runs. There's this dude who averaged nearly a 100 in every innings, most others average 50.

Now think of the situation as him scoring a few knots. Is he old and retiring? Or is this just a slump in form? Nobody knows!

I worked for a design team and we were proud of our material engineers.

stcredzero · on Dec 11, 2018

Back in about 1996, most of the profs were going on about how x86 would crumble under the weight of the ISA, and RISC was the future. One of my profs knew people at Intel, and talked of a roadmap they had for kicking butt for the next dozen years. Turns out, the road map was more or less right.

Is there more roadmap?

tw04 · on Dec 11, 2018

There's just no way that's true. Their roadmap in 1996 was moving everyone to ia64/itanium. That was an unmitigated disaster and they were forced to license x64 from AMD.

If it weren't for their illegal activity (threats/bribes to partners) to stifle AMDs market penetration, the market would likely look very different today.

antod · on Dec 11, 2018

> There's just no way that's true. Their roadmap in 1996 was moving everyone to ia64/itanium. That was an unmitigated disaster and they were forced to license x64 from AMD.

Yup, and their x86 backup plan (Netburst scaling all the way to 10GHz) was a dead end too.

gpderetta · on Dec 11, 2018

but their plan C (revive the pentium III architecture) worked perfectly.

We will have to see if they have plan C now(plan B being yet another iteration of the lake architecture with little changes).

jplayer01 · on Dec 11, 2018

Their plan C was a complete fluke and only came together because the Israelis managed to put out Centrino. I don't think such a fluke is possible when we're at the limits of process design and everything takes tens of billions of dollars and half a decade of lead time to implement.

gpderetta · on Dec 11, 2018

Having multiple competent design teams working on potentially competing products all the time is one of the strengths of Intel, I wouldn't call it a fluke.

Things do look dire right now, I agree.

bogomipz · on Dec 11, 2018

I'm not that up on Intel at the moment. Why are they stuck at more iterations of the lake architecture with little changes?

What was the plan "A"?

gpderetta · on Dec 11, 2018

Get 10nm out.

m_mueller · on Dec 11, 2018

doesn't it look like they're shifting to do chiplets as well at the moment? copying AMD might be their plan C, but it won't help if AMD can steam ahead with TSMC 7nm while Intel is locked to 14nm for a couple of years. That's going to hurt a lot.

pitaj · on Dec 11, 2018

TSMC's 7nm and Intel's 14nm are about the same in actual dimensions on silicon IIRC. The names for the processes are mostly fluff.

gpderetta · on Dec 11, 2018

AFAIK, supposedly TMSC 7nm and Intel 10nm are about equivalent, but with 10nm being in a limbo, TMSC is ahead now.

m_mueller · on Dec 12, 2018

That’s also how I understand it, which seems to be supported by perf/watt numbers of Apple’s 2018 chips.

pjmlp · on Dec 11, 2018

Likewise if AMD wasn't a thing maybe this laptop would be running Itanium instead.

hermitdev · on Dec 11, 2018

Intel chips are RISC under the hood these days (for a long while - decade or more). They're CISC at the ASM layer before the instructions are decoded and dispatched by the microcode.

mcbain · on Dec 11, 2018

The idea that Intel is “RISC under the hood” is too simplistic.

Instructions get decoded and some end up looking like RISC instructions, but there is so much macro- and micro-op fusion going on, as well as reordering, etc, that it is nothing like a RISC machine.

(The whole argument is kind of pointless anyway.)

rbanffy · on Dec 11, 2018

With all the optimizations going on, high performance RISC designs don't look like RISC designs anymore either. The ISA has very little to do with whatever the execution units actually see or execute.

rocqua · on Dec 11, 2018

It is baffling to me that byte-code is essentially a 'high level language' these days.

rbanffy · on Dec 11, 2018

And yet, when I first approached C, it was considered a "high level" language.

bogomipz · on Dec 11, 2018

Because the functional units are utilizing microcode? Or do you mean something else?

Oreb · on Dec 11, 2018

Possibly stupid questions from someone completely ignorant about hardware:

If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer? If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

Alternatively, could Intel theoretically implement a new, non-x86 ASM layer that would decode down to better optimized microcode?

lmm · on Dec 11, 2018

> If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer?

The microcode is the CPU firmware that turns x86 (or whatever) instructions into micro-ops. In theory if you knew about all the internals of your CPU you could upload your own microcode that would run some custom ISA (which could be straight micro-ops I guess).

> If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

The most important concern for modern code performance is cache efficiency, so x86-style instruction sets actually lead to better performance than a vanilla RISC-style one (the complex instructions act as a de facto compression mechanism) - compare ARM Thumb.

Instruction sets that are more efficient than x86 are certainly possible, especially if you allowed the compiler to come up with a custom instruction set for your particular program and microcode to implement it. (It'd be like a slightly more efficient version of how the high-performance K programming language works: interpreted, but by an interpreter designed to be small enough to fit into L1 cache). But we're talking about a small difference; existing processors are designed to implement x86 efficiently, and there's been a huge amount of compiler work put into producing efficient x86.

earenndil · on Dec 11, 2018

CISC is still an asset rather than a liability, though, as it means you can fit more code into cache.

chroma · on Dec 11, 2018

I don't think that's an advantage these days. The bottleneck seems to be decoding instructions, and that's easier to parallelize if instructions are fixed width. Case in point: The big cores on Apple's A11 and A12 SoCs can decode 7 instructions per cycle. Intel's Skylake can do 5. Intel CPUs also have μop caches because decoding x86 is so expensive.

jabl · on Dec 11, 2018

Maybe the golden middle path is compressed risc instructions. E.g the risc-v C extension, where the most commonly used instructions take 16 bits, and the full 32-bit instructions are still available. Density is apparently slightly better than x86-64, while being easier to decode.

(yes, I'm aware there's no high performance risc-v core available (yet) comparable to x86-64 or power, or even the higher end arm ones)

gpderetta · on Dec 11, 2018

Sure, but intel CISC instructions can do more, so in the end is a wash.

chroma · on Dec 11, 2018

That's not the case. Only one of Skylake's decoders can translate complex x86 instructions. The other 4 are simple decoders, and can only transform a simple x86 instruction into a single µop. At most, Skylake's decoder can emit 5 µops per cycle.[1]

1. https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

jawnv6 · on Dec 11, 2018

... so what? most code's hot and should be issued from the uop cache at 6uop/cl with "80%+ hit rate" from your source

you're really not making the case that "decode" is the bottleneck, are you unaware of the mitigations that x86 designs have taken to alleviate that? or are those mitigations your proof that the ISA's deficient

Symmetry · on Dec 11, 2018

That really isn't true in the modern world. x86 has things like load-op and large inline constants but ARM has things like load or store multiple, predication, and more registers. They tend to take about the same number of instructions per executable and about the same number of bytes per instruction.

If you're comparing to MIPS then sure x86 is more efficient. And x86 is instruction do more than RISC-V but most high performance RISC-V uses instruction compression and front end fusion for similar pipeline and cache usage.

gpderetta · on Dec 11, 2018

(generalized) predication is not a thing in ARM64. Is Apple cpu 7 wide even in 32 bit mode?

It is true though, as chroma pointed out, that intel can't decode load-op as full width.

wtallis · on Dec 11, 2018

You can fit more code into the same sized cache, but you also need an extra cache layer for the decoded µops, and a much more complicated fetch/decode/dispatch part of the pipeline. It clearly works, at least for the high per-core power levels that Intel targets, but it's not obvious whether it saves transistors or improves performance compared to having an instruction set that accurately reflects the true execution resources, and just increasing the L1i$ size. Ultimately, only one of the strategies is viable when you're trying to maintain binary compatibility across dozens of microarchitecture generations.

gpderetta · on Dec 11, 2018

The fact is that a postdecode cache is desirable even on an high performance RISC design as even there skipping decode and fetch is desirable for both performance and power usage.

IBM Power9 for example has a predecode stage before L1.

You could say that, in general, riscs can get away without extra complexity for a longer time while x86 must implement it early (this is also true for example for memory speculation due to the more restrictive intel memory model, or optimized hardware TLB walkers), but in the end it can be an advantage for x86 (more mature implementation).

Symmetry · on Dec 11, 2018

In theory, yes. In practice x86-64, while it was the right solution for the market, isn't a very efficient encoding and doesn't fit any more code in cache than pragmatic RISC designs like ARM. It still beats more purist RISC designs like MIPS but not by as much as pure x86 did.

It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock. But legacy compatibility means that that scheme will not be x86 based.

bogomipz · on Dec 11, 2018

>"It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock."

How might a self-synchronizing encoding scheme work? How could a decoder be divorced from the clock pulse? I am intrigued by this idea.

Symmetry · on Dec 11, 2018

What I mean is self-synchronizing like UTF-8. For example the first bit of a byte being 1 if its the start of an instruction and 0 otherwise. Just enough to know where the instruction starts are without having to decode the instructions up to that point and so that a jump to an address that's the middle of an instruction can raise a fault. Checking the security of x86 executables can be hard sometimes because reading a string of instructions started from address FOO will give you a stream of innocuous instructions whereas reading starting at address FOO+1 will give you a different stream of instructions that does something malicious.

jawnv6 · on Dec 11, 2018

sure, so what's your 6 byte equivalent ARM for FXSAVE/FXRSTOR?

n-gatedotcom · on Dec 11, 2018

What is an example of a commonly used complex instruction that is "simplified"/DNE in RISC? (in asm, not binary)

gpderetta · on Dec 11, 2018

Load-op instructions do not normally exist on RISC, but are common on CISC.

adrianN · on Dec 11, 2018

You do have to worry about the µ-op cache nowadays.

Symmetry · on Dec 11, 2018

Those profs were still living in 1990 when the x86 tax was still a real issue. As cores get bigger the extra effort involved in handling the x86 ISA gets proportionally smaller. x86 has a accumulated a lot of features over the years and figuring out how, e.g., call gates interact with mis-speculated branches means an x86 design will take more engineering effort than an equivalent RISC design. But with Intel's huge volumes they can more than afford that extra effort.

Of course Intel has traditionally always used their volume to be ahead in process technology and at the moment they seem to be slipping behind. So who knows.

bogomipz · on Dec 11, 2018

>"As cores get bigger the extra effort involved in handling the x86 ISA gets proportionally smaller."

Can you elaborate on what you mean here? Do you mean as the number of cores gets bigger? Surely the size of the cores has been shrinking no?

>"Of course Intel has traditionally always used their volume to be ahead in process technology"

What's the correlation between larger volumes and quicker advances in process technology? Is it simply more cash to put back into R and D?

Symmetry · on Dec 11, 2018

When RISC was first introduced its big advantage was that by reducing the number of instructions it could handle the whole processor could be fit onto a single chip whereas CISC chips took multiple chips. In the modern day it takes a lot more transistors and power to decode 4 x86 instructions in one cycle than 4 RISC instructions because you know the RISC instruction are going to start on bytes 0, 4, 8, and 12 whereas the x86 instructions could be starting on any bytes in the window. So you have to look at most of the bytes as if they could be an instruction start until later in the cycle you figure out if they were or not. And any given bit in the instruction might be put to more possible uses increasing the logical depth of the decoder.

But that complexity only goes up linearly with pipeline depth in constrast to structures like the ROB that grow as the square of the depth. So it's not really a big deal. An ARM server is more likely to just slap on 6 decoders to the front end because "why not?" whereas x86 processors will tend to limit themselves to 4 but that very rarely makes any sort of difference in normal code. The decode stage is just a small proportion of the overall transistor and power cost of a deeply pipelined out of order chip.

In, say, dual-issue in-order processors like an A53 the decode tax of an x86 is actually an issue and that's part of why you don't see BIG.little approaches in x86 land and why atom did so poorly in the phone market.

For your second question, yes, spending more money means you can pursue more R&D and tend to bring up new process nodes more quickly. Being ahead means that your competitors can see which approaches worked out and which didn't and so re-direct their research more profitably for a rubber band effect, plus you're all reliant on the same suppliers for input equipment so a given advantage in expenditure tends to lead to finite rather than ever-increasing lead.

bogomipz · on Dec 11, 2018

Thanks for the thorough detailed reply, I really appreciate it. I failed to grasp one thing you mentioned which is:

>"is actually an issue and that's part of why you don't see BIG.little approaches in x86 land and why atom did so poorly in the phone market."

Is BIG an acronym here? I had trouble understanding that sentence. Cheers.

Symmetry · on Dec 12, 2018

I was reproducing an ARM marketing term incorrectly.

https://en.wikipedia.org/wiki/ARM_big.LITTLE

Basically, the idea is that you have a number of small, low power cores together with a few larger, faster, but less efficient cores. Intel hasn't made anything doing that. Intel also tried to get x86 chips into phones but it didn't work out for them.

bogomipz · on Dec 12, 2018

Thanks, this is actually a good read and clever bit of marketing. Cheers.

oblio · on Dec 11, 2018

Your parallel is hard to follow for people who don't watch cricket. I have no idea how 100 or 50 "innings" relate to a few "knots". Are they like some sort of weird imperial measures? (furlongs vs fathoms?)

jholman · on Dec 11, 2018

I suspect that "knots" was supposed to be "noughts", a.k.a zeros. That is, the last few times the 100-point batsman was at bat, he got struck out without scoring any points. Is he washed up?

I don't think it's a very useful analogy. :)

ctack · on Dec 11, 2018

Knots as in ducks?