Hacker News new | past | comments | ask | show | jobs | submit login
Sandy bridge, Ivy bridge, Haswell and the mess of Intel processors feature list (ilsistemista.net)
116 points by shodanshok on Dec 26, 2014 | hide | past | favorite | 69 comments



There is a pervasive issue in that most code written today is tacitly designed for the assumptions of microarchitectures from a decade ago, and most code is even older and is rarely rewritten.

Generally speaking, compilers have a limited ability to improve the basic design of code or deviate from the assumptions about hardware embedded in that code. More often than not, taking advantage of different microarchitectures is not just about the code generation, it is also about algorithm design and selection, which falls directly on the programmer. I currently design software that assumes the Haswell microarchitecture. While the code is written in C++ and the high level architecture is generally the same, the lower level idioms are different because they reflect the strength of the microarchitecture. If the code was for the latest ARM core, it would need to be quite different to be efficient. And I am not even talking about things like AVX2. Idiom changes to match microarchitectures are not "micro-optimizations" in the traditional sense; it will often buy 2x performance improvement overall versus well designed generic code tacitly optimized for another microarchitecture.

The reality is that almost no one writes code for e.g. the Haswell microarchitecture. The idea of writing code for a microarchitecture at the level of, say, C++ does not even enter most programmers' minds. Broadly speaking, a compiler will not allow you to take advantage of advancements in microarchitecture except by accident, which works sometimes but is not efficient in terms of return on the new microarchitecture. Compilers learn some micro-idioms, but most of the macro-idioms escape them (like designing algorithms for the ALU parallelism of a particular microarchitecture).

The last big microarchitecture change for Intel was Nehalem, the original "i7". A lot of code optimization assumptions from prior CPUs went out the window with that microarchitecture. The latest, Haswell, added more usable ALU parallelism and the BMI2 extensions, which are quite useful if you know how to exploit them but nothing Earth-shattering. AVX2 was nice, but too limited to have much value for most normal code.

That said, I am pretty excited about AVX-512, which will be a major microarchitecture extension once that it is available. The giant caveat is that no one will be able to really take advantage of it without, once again, redesigning their code for that microarchitecture. A compiler won't be able to do that for you, which has been the real stumbling block for the adoption of advanced features in new microarchitectures. What constitutes an optimal algorithm or data structure is microarchitecture dependent.


It's worth pointing out that Sandy Bridge was a massive microarchitectural change from its predecessors, using a totally different out-of-order engine.

I don't know why the performance characteristics are so similar. I guess once you have a big fat memory subsystem on-die, everything else is diminishing returns?


I don't know why the performance characteristics are so similar. I guess once you have a big fat memory subsystem on-die, everything else is diminishing returns?

I don't find that to be true. As Andrew is saying, the performance characteristics only look similar because the software is rarely written to optimize performance on modern processors. I have a 5-year old overclocked Nehalem that can beat most current Haswell's on single-threaded code of this sort just because of the higher clock speed. But for algorithms designed for the capabilities of the newer instruction sets, the different generations really start to distinguish themselves.

Instead of ~5% improvements between Intel generations on standard benchmarks, you can sometimes get 50% or more for architecture specific algorithms. From Nehalem to Sandy Bridge to Haswell, the maximum per-cycle reads from L1 has gone from 16B to 32B to 64B. This means that approaches that would have been silly 5 years ago (like 16KB lookup tables from which you need to read 32B every cycle to sustain throughput) can be practical now.


> I don't know why the performance characteristics are so similar. I guess once you have a big fat memory subsystem on-die, everything else is diminishing returns?

The first-gen PRF design was built to have the throughput of the old one, while having much gentler design limits and lower power consumption. Haswell expanded the execution engine to 8-wide (4xALU, 2x load agu, 1x store agu, 1x store data), something the older design couldn't have done.


The last big microarchitecture change for Intel was Nehalem, the original "i7". A lot of code optimization assumptions from prior CPUs went out the window with that microarchitecture.

I'd say the P6-NetBurst change was the biggest, and in many ways it was also a failure specifically because of its extremely different performance characteristics; it was far more "RISCier" than its predecessors or successors, and instruction sequences that were fast on the P6 became much slower (e.g. shifts). The long pipeline made branch penalties increase. It tried to compensate with high clock frequency, but that didn't always work. Existing software had to be recompiled to get much improvement (and compilers updated accordingly), something that a lot of people just didn't want to/couldn't do, for various reasons. Fortunately the change to P6-based Core brought back with it the more predictable performance characteristics of the P6, and its successors have been more incremental improvements on that than a radical new direction.

it will often buy 2x performance improvement overall versus well designed generic code tacitly optimized for another microarchitecture

2x is a big gap, but it's still quite small compared to the 5-10x (or more) between NetBurst and P6 on the same code.


[Netburst] tried to compensate with high clock frequency, , but that didn't always work.

In part because they grossly misjudged what speeds they could achieve. It was sold on the premise it could be driven to 10 GHz (maybe 5 GHz in some accounts; it was viewed as a "mar(keting)architecture that could boast high raw speeds), but they had extreme difficultly achieving 3.8 GHz, and as I recall the systems of the day weren't really geared to dissipating the heat produced.


Do you have any recommendations on material to read on the subject of choosing optimal data structures / algorithms for a particular microarchitecture?


Most of this follows from understanding (1) what kinds of things a modern core can do simultaneously and (2) how long it takes a core to do those things. On top of this, there are many additional rules regarding memory and cache access behavior.

One of the most useful references for me is Agner Fog's instruction tables, which are updated regularly and you can download for free. They are genuinely an excellent resource. If you roughly understand how C or C++ is translated into machine instructions (not a big leap) then you can understand how bits of code will interact with the microarchitecture and what surrounding instructions can be executed concurrently in the same clock cycle. Basically you can have multiple 'threads' of execution running in a single core at the same time per clock cycle on a modern CPU. Good CPUs will try to do this for you but have limited ability to find this concurrency if you do not make it 'obvious' to the CPU with code idioms that make it impossible to miss. The amount of concurrency you can extract depends on the specific instructions your code is being translated into. Haswell has 4 ALUs per core but each ALU has different capabilities in terms of the subset of instructions it can execute; for example, some basic instructions can only be scheduled on two of the ALUs.

http://agner.org/optimize/instruction_tables.pdf

Additionally, the architecture docs from CPU vendors is always insightful about the topology of the silicon, how everything is wired together, and the tradeoffs that are never mentioned in high-level marketing docs.


Good post. I never thought of potential for macro idioms. Are you talking purely about loops or other things like synchronization or memory model related stuff? Can you give a few examples of these macro idioms?


Could you write a blog post about this? I would love to hear more about the work you do and the considerations you have to take into account between micro-architectures.


Do you have any examples of this handy? I have seen a lot of cases of idioms and algorithm design changing to match the exact CPU model. Approximate CPU generation, sure: try hard to cut down on memory traffic - but that's true for every CPU nowadays. Misaligned accesses aren't expensive anymore - but there's almost never any good reason to do them anyway. What else is there?


The author is complaining out of context on a couple of points here:

1) The B820 came out 1.25 years later than the B940. So, it's not hard to see that maybe Intel decided in that timeframe to add VT-x support.

2) None of the K-series CPUs support VT-d. For some reason, VT-d and overclocking are not compatible. This would have been more noticeable if, for example, the author had compared the i7-4770K to the i7-4770, not the i5-4570. The same distinction is there between the i5-4670 and i5-4670K.

3) The table is organized in a misleading fashion. Why put the weakest CPUs in the center except to make proper trend be intentionally disrupted

After fixing or striking these issues, this isn't even an article worth writing because the features suddenly make sense.


> "None of the K-series CPUs support VT-d. For some reason, VT-d and overclocking are not compatible."

No, compatible usually implies there's a technological or engineering justification. This is pure marketing. And the Devils Canyon update to Haswell introduced some -K parts that have VT-d.

Edit: And support likewise implies that even if it's not a purely technical issue, that there's at least some tradeoff. Enabling virtualization has a price to the consumer, but doesn't cost Intel anything extra.


Not true. The 4790k supports vt-d. All indications are that it's a market segmentation choice Intel made.


As someone excited by the improvements that are possible moving from AVX to AVX2, I was recently surprised to learn that Intel has chips released this year that do not support AVX. AVX-512 is still supposedly scheduled for release next year with Skylake.

How long do you suppose until it's reasonable to presume that most processors will support it? SSE 4.2 came out in 2008 and appears to be the baseline now, so 5 years seems like a reasonable guess.

It's hard to believe this degree of ISA segmentation is going to benefit Intel in the end.


That's why I feel dynamic code generation is the future. JIT. It's madness to write high performance code to take CPU [1] and DRAM [2] specifics into account.

It's easy to write something that performs well on one system.

[1]: 32-bit or 64-bit -> how many registers are available. SSE2, SSE 4.1, AVX2, etc. Different prefetchers, cache configurations. Different instruction level limitations and performance. Differing amounts of cache with rather variable latencies.

[2]: Different number of memory channels, different distance between DRAM page changes.


Compilers generate static code that will branch to the correct code path based on your architecture. You don't need to JIT until there are so many paths your executable becomes too big.


Do the branches happen at run time or are the optimized routines selected at install time (or something else?)


At load time. gcc's approach is to substitute an entry in the PLT, the CPU dispatch logic being run at that time to select the contents of the entry.


Any references about this compiler?


https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/Function-Attrib... and http://pasky.or.cz/dev/glibc/ifunc.c

look up the target attribute and the ifunc attribute--it's basically a way to compile multiple versions of a function for different targets in a single source file and then use the dynamic linker to determine which one to resolve at runtime. obvious use is for things like optimized memcpy implementations.


So, it's otherwise automatic, except I just have to write a selector routine that tries to decide the best performing routine to run at runtime and implementation for each individual case with varying hardware support.


You only need to go to all that trouble if you want high performance across a variety of machines. If you are merely after bragging rights or trying to satisfy someone else's requirement, the theory is that you can compile the exact same piece of high level code using different optimization targets, and the compiler will do all the work for you, providing maximum performance for each instruction set practically for free...

Even more practically, Agner has a typically excellent description of the strengths and weaknesses of the dispatch strategies used by different compilers in Section 13 (p. 122) here: http://www.agner.org/optimize/optimizing_cpp.pdf


Well, you can write that code that tries to decide, or you can let GCC emit that code for you: https://gcc.gnu.org/wiki/FunctionMultiVersioning

All you need to do is figure out what's the best version of that function to write for that target, and let GCC do the rest of the heavy lifting.


icc does this, as an example. It got Intel into trouble because they changed the compiler at one point to use the slowest path on AMD processors.


I think that the SIMD instructions are well past the point of diminishing returns, both for the kinds of instructions provided and the width of the vectors. I'm more interested in how VT-d and TSX allow different software architectures, if and only if you can design your OS around the assumption that they're available. By only including them on select models, Intel's making them much less useful even for the customers who want to pay extra to have those features.


I used to write software rasterizers in a past life. AVX-512 is straight-up an observed 8x performance gain over SSE: a well-written software rasterizer is a very clever thread-scheduling algorithm wrapped around long lists of FMA-load/store sequences.

I know that ray-tracers, databases, and other 'big data' all gain equally-well from AVX-512. The IO-request depth on these new parts is such that, for ALU-heavy work, with good IO-spread, memory latency is completely hidden---fetch time is still ~250-300c, but you just no longer see it.

The only thing I miss is 'addsets', which was dropped from LRBni, as this instruction was 'the rasterizer function': it now requires a fairly involved sequence to replicate in AVX-512.

Pining for other lost things: if we had the LRBni up- and down- convert instructions, a software texture sampler would be a lot more feasible.


You only get those performance gains if you've got the cache and memory bandwidth - which is often very lacking.

E.g. AVX (8-wide) was added for SandyBridge, but it wasn't always that usable until Intel doubled the cache bandwidth with Ivy Bridge.

With ray tracing the increase in performance going from SSE (4-wide) to AVX (8-wide) for BVH intersection was only ~25-30% - instead of the theoretical 100% increase). You're generally limited by memory bandwidth.


Wider SIMD units may provide a linear performance increase for many workloads, but they have a super-linear impact on the up front cost of the chip, especially when you take into account the opportunity cost of those transistors: they could have been dedicated to something that might have also helped non-numerical workloads. Wide SIMD is great to have, but it doesn't come free, or else we wouldn't have GPUs.


It may just be that Intel has abandoned this use case to the GPU. Which obviously is bad news for those of us who still rely on CPU SIMD.

But wtallis' argument to the contrary is quite compelling: https://news.ycombinator.com/item?id=8801808


Does gcc support these new wider registers yet?


Yes and no. If you write the ASM you can use them, but it's code generation likely won't (but then I haven't checked for 2 months).

Most things that can be vectorizes will be placed in SSE rather then AVX. Also the GCC generally sucks at optimizing for SSE, or determining when code should use SSE as opposed to standard registers.

Generally speaking the LLVM backend does better SSE and vectorization code generation. But some think it used SSE to much/incorrectly.

So TLDR no

New wide registers are VERY new. The are barely supported, my processor has AVX2.0 and I have (still have to) one of these days set up perf properly because all of its fault codes aren't properly baked into the kernel yet (as of 3.17).

(Sorry for the lack of references)

Also the biggest draw back of AVX is they don't hold their state between context changes :/


CISC is rapidly becoming a requirement, however.

Power requirements scale as the square of the frequency, and also, perhaps even more importantly, leakage power, and hence, heat generated per unit area, increases as the process size decreases.

As such, as process sizes shrink, less and less actual area of the chip can be active at any one time. ("Dark silicon"). In other words, you might as well have areas of the chip that aren't used most of the time, because most of the chip can't be active at once anyways.


Image processing, etc. stuff I sometimes write seems to benefit pretty linearly from wider SIMD registers, like in AVX2. It'd be useful to get up to a cache line wide SIMD registers. Cache line is the granularity memory subsystem works at anyways.

I wonder when cache line size will be increased from 64 to 128 bytes. Hopefully that wouldn't affect total number of lines, though. Mere 512 entries in L1D cache is already such a pain.


I think that the multiple 32-way SIMD units on GPUs demonstrates there's plenty of room for widening on CPUs before we hit diminishing returns. The question in my mind is how much more like a GPU should CPUs become and Xeon Phi demonstrated how easy it is to get the mix totally wrong.


There are clearly plenty of common workloads that can make use of arbitrarily wide SIMD units. But nothing about the GPU market provides a clear indication that those SIMD units are better grafted onto the CPU than delegated to a coprocessor.

The programming model is certainly simpler if they're part of the CPU's normal instruction pipeline, but on the other hand most CPUs don't have the memory bandwidth to feed more than a handful of GPU-style SIMD units and certainly don't have the power delivery or dissipation capacity to handle the compute resources of a mid-range GPU in addition to the normal CPU.

Having the big SIMD units on a coprocessor card seems to be working just fine for the kinds of workloads that really need a lot of vector processing power. Those workloads also seem to not benefit much from things like speculative out of order execution, which frees up a lot of die space and power draw budget for more SIMD units. To some extent it's inevitable that SIMD code can't be making too much use of CPU design features that benefit branchy code, or else the code couldn't be vectorized in the first place.

The workloads that are making heavy use of SIMD on the CPU are mainly using it for bursts of number crunching where the problem is too small to be worth the latency penalty of moving it to the GPU, but that penalty is getting smaller with each generation. The small integrated graphics processors common on non-server CPUs can all but eliminate that penalty, further reducing the demand for extra-wide SIMD in the CPU cores.


I've noticed that CPU companies are focusing more and more on power efficiency to compete with ARM potentially. Maybe turning off the features described in the OP makes lower end CPUs run faster?

That might be significant since i also presume most laptops use those same lower end CPUs. Laptops being the ones that need to save energy much of the time. (Desktops are obviously plugged in).

This is probably most significant in the Microsoft Surface (it uses a real intel CPU if i recall correctly).

I wonder which features it has marked "off" , and what impact that has on battery power.


Remember, within a generation these chips are almost all made from the same set of masks. None of these disabled features save any transistors or die area; the disabling just leaves you with a dead spot on the chip. So the real question is if you've got the hardware in place to provide these special features, can you save any power by not using those features?

The answer is that you only save power by not doing at all the things those features accelerate. It's never more efficient to implement AES through general-purpose instructions than using the fixed-function hardware. If you've got a workload that can benefit from the wider SIMD units, it won't save any energy to only use the narrower instructions. If you're trying to secure or virtualize your OS, doing it with VT-x and VT-d will be so much faster than emulating it in software that even if those features added several watts to the chips power consumption you would still save energy.

Doing less can save power, but doing the same amount of stuff the hard way doesn't help anything.


That's only true if you are using those features 100% of the time though, right ? By which I mean, there is no doubt a trade-off in play when you only use the features for 15% of your workload. This is because the chip must be powered on all the time but, say, the software implementation of AES doesn't always have to be running.


No, special-purpose functional units are very well power-gated on modern chips. They don't draw appreciably more power when not in use than when fused off.


you say that, but in practice Intel is still 5-10x off ARM when it comes to idle power.


Register renaming and instruction reordering and things like that can't be selectively powered down, and are very power-hungry. But if you were talking about when the entire core is suspended, then that's largely unrelated to the high-level feature set of the processor. It's about things like the analog characteristics of the transistors used: Intel builds their mainstream CPUs using circuits that can run at 4-5GHz without trouble while none of the mobile SoCs can come close regardless of exotic cooling.


Is that true for Merrifield/Moorefield? AFAIK they're pretty close, but only in the phone chips.


Remember, within a generation these chips are almost all made from the same set of masks. None of these disabled features save any transistors or die area

I think it might be true for CPUID/stepping, but not for what are usually considered to be the generations.

For example, there look to be at least 5 dies sizes for Sandy Bridge: http://en.wikipedia.org/wiki/Sandy_Bridge#Models_and_steppin...

I haven't tried to match up the supported ISA's with the different die sizes, but I presume they'd correlate. Do they not?


The quad-core Sandy Bridge could end up labeled as a Core i5, Core i7, or Xeon E3. The dual core Sandy Bridge could be labeled as a Pentium, Celeron, Core i3, Core i5, Core i7 (mobile), or Xeon E3.

The server designs (LGA2011) have QPI links and may be the only memory controllers capable of using registered memory. But among the processors targeting the LGA1155 and mobile platforms, they only had one core design and only one memory controller design that was shared across all products though never with everything enabled at once.


Yes, I agree, number of cores and size of L3 are commonly cases cases where functionality is disabled. But do you know which of the other factors listed in the article that this is true for? For example, are there cases where two processors share the same die but one supports AVX and the other does not? I don't know the answer, but if there are not, this would substantially undercut the argument in the main article.


Sandy Bridge and Ivy Bridge Celerons and Pentiums use the same die as the corresponding Core i3s, but only the i3 supports AVX in the CPU and QuickSync in the GPU. AES instructions aren't available on the i3 but are present on the dual-core i5. Haswell no longer discriminates with QuickSync, and extends AES down to the i3s but still not the Celerons and Pentiums.

None of these features are unavailable due to being physically absent. They're just turned off. If Intel's microcode format and DRM were reverse engineered, some of the features could be turned back on.


Thanks for the great info. That was the part I couldn't figure out from the article or the specs.


Artificially disabling ECC (error correcting) SDRAM support is very disappointing. In my experience ECC significantly improves stability. Laptops (without ECC RAM) crash a few times per year, desktops with ECC RAM can run stable for years.


You can also disable defective components without being forced to bin the entire chip to a recycle bin.

That is honestly probably more of a factor than anything for the segregation between models. Get that yield up to improve margins significantly.


Disabling whole cores or lumps of cache to improve yields is one thing, but a lot of the features Intel is segmenting the market with are very tightly integrated with the rest of the chip. It would be extremely hard to detect and classify a defect as affecting HyperThreading or VT-x but leaving the core otherwise functional. It's a bit more plausible that defects in the high half of a vector SIMD unit could leave you with a SSE-capable but AVX-defective chip, but I suspect that in practice any broken functional units are treated as rendering the entire core inoperable.


The "lowest" end of the CPUs that the OP discusses are low end desktop/laptop parts. Furthermore, the biggest criticism of the author isn't the feature disabling per se, but how inconsistently Intel applies it across their overall performance bins. For example, the lack of TXT support in high end i7 parts when its there for middle tier i5 parts. Why the i7-3990K has VT-d and the i7-3770K doesnt(which really should just be distinguished by their clock/cache/TDP).


3770K cpu has basic version 3770, which has the VT-d ext. There is no such thing like 3990k.


The author makes a point, when CPUs are "good enough" its a real problem for CPU manufacturers. They don't price them to have a service life of 15 years, they expect them to be replaced every 2 - 3 years. And the option there is to featurize them. It is interesting to watch in the ARM space as well.


I don't think a lot of people understand that some of these features need more than a clear chicken bit. Some chips can be incompatible with features due to physical reasons.

Other of these features make little sense for some chips and some markets. For example, some of the VTd features. The majority of personal users aren't scheduling more than one or two VMs max. Do we really need VPID and larger TLBs for such workloads? No. So why include the feature?

If binary size becomes an issue then we need a better solution. Maybe its delaying some or all optimization for install time. Maybe its providing individual binaries for different targets.


Some chips can be incompatible with features due to physical reasons

I think this is good point. So far, I think the AVX/AVX2 distinction is based only on release date --- the newer chips support the newer instruction set. This is a good sort of improvement. And it's probably only practical to support AVX and AVX2 if you have 256-bit registers to work with.

Does the lower end mobile Pentium branded line even have full physical support for 256-bit vectors? That is, are they supporting 128-bit SSE instructions in a full 256-bit register but not providing any instructions that use the upper lane?


There is a strong advantage to a homogenous ISA. I am glad to see that newer chips have more full support of these instructions. Maybe they are still emulating them with micro-ops without real acceleration still convenient.

Regarding the mobile lineup and AVx2, honestly I wouldn't know. It is very much possible that they disabled it to avoid competing with higher lineups. But a more likely (IMO) reason is that to add 256 bit support added enough costs that the were either unacceptable or simply aren't justified by target workloads when evaluated in simulations.

The physical reasons I was alluding to revolve around power, area, and capacitance (delay). These inform the cost ($$$) and performance of chips. Some of these big features (like AVX3) push the limits of what you can do without having to use high performance transistors or deal with intense power draws.

Another physical reason is that sometime parts suffer effects during manufacturing. Modern design of chips is inherently modular, so sometimes the only option is to disable certain parts of the chip.


VT-d is branded as a virtualization feature, but it's also a killer security feature. It should be treated as a hard requirement for any system that has DMA-capable external interfaces like Firewire or Thunderbolt. It can be used to protect the system from exploits that run on the GPU or any of the other processors that are included in peripheral devices. But nobody's going to completely redesign the driver model of their OS if only 3% of their users can reap the benefits.


I actually never thought of it from that perspective. Can you expand on this? Is the primary threat DOS or malicious DMA ops from external interfaces and devices and the primary advantage mitigation by rerouting I/O from that device to a "null" VM?


Trying to buy me a present atm, new desktop machine.

This is actually one of my pain points, because none of the sites I visit list cpu features and I am interested in some (notably virtualization related extensions).


Intel publishes this information for every CPU they make at http://ark.intel.com/


Strangely enough, ECC is supported on some Core i3 models.


Moore's law is over. Pentium 4s reached 3GHZ 15 years ago. CPUs try things like parallelizing while the human brain thinks sequentially. CPUs are still getting faster but at an incrementally slower rate. A 4 year old Macbook Air is half as performant as the latest. In previous decades, the performance gains were 2-5X per year.

A paradigm shift (quantum computing?) needs to happen soon so computation can continue scaling.


Moore's law was never about performance.


Moore's law is about the number of transistors on a chip which is reaching a limit because heat cannot be dissipated fast enough. Heat is the primary limit on performance.


More transistors is still better, even if they need to be off most of the time. You can spend more silicon on advanced features, just like AVX, spending transistor count for performance at neutral power cost.


"In previous decades, the performance gains were 2-5X per year."

What decade was that?

According to this[1], it's never been better than ~50% year over year.

[1] http://www.cs.columbia.edu/~sedwards/classes/2012/3827-sprin...


The illusion of the Cartesian theater is seemingly sequential. Your brain is not.


Here's a bit more on linear thinking: http://en.wikipedia.org/wiki/Cocktail_party_effect

*Feel free to send more insults my way if it fulfills your ego. I am here to listen, but please keep the discussion constructive for the sake of the community.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: