How I found a bug in Intel Skylake processors (2017)

lordnacho · on Nov 8, 2021

The problem with bugs deep in the stack is that it is really time consuming to establish that they are in fact as deep as they are.

I wrote a Swift iOS app once, and came across an issue with one of the collection classes.

Of course, nobody thinks that the Swift libs will be wrong as a first guess. So I worked through a number of hypotheses about my own code, slowly stripping out pieces that I thought might contain an error. And then combinations. I also tried reducing the number of entries just to simplify the logs. This worked, but of course you are not going to think that there's a library bug affecting collections with size > 16, and it wasn't actually a theory until I randomly decided to reduce the n. I also discovered that it worked just fine on release but not debug, so I thought maybe I have some race condition.

More and more stripping down occurred, until I eventually gave up using my own project and just started a new one just to see about the collection class. I did it for the sake of being thorough, rather than actually thinking the lib had a bug in its debug implementation. But lo and behold, when I managed to make it reproducible and put it on SO, someone from Apple acknowledged that they could also see it, and they fixed it.

Naturally if I'd gone direct to testing the lib I'd have saved a huge amount of time, but I guess that's the tradeoff from the most sensible heuristic: test your own code first, the bug is there.

gh123man · on Nov 8, 2021

> nobody thinks that the Swift libs will be wrong as a first guess

This is highly dependent on which version of Swift you started with! When Swift introduced the new substring API I hit a bug where certain UTF-8 character sequences caused an index out of bounds error internally. Unfortunately we learned this in production when an entire organization couldn't launch our app due to a string they were feeding through it.

That is how your trust in the standard libs is forever broken. Library and compiler bugs were quite common in the Swift 1-3 days.

jcelerier · on Nov 8, 2021

Yeah, over the course of my allegedly short career (I'm 29) I've reported dozens of bugs against GCC, Clang, MSVC, binutils, Qt, SDL, glibc, PortAudio, macOS and other foundational stuff... I'm not saying I automatically assume "toolchain bug", but my cutoff for seriously pondering "is it a bug in $underlying_stuff" is around 30 minutes of "I really can't see where in my code things were done wrong" and so far this heuristic has consistently held...

resonious · on Nov 9, 2021

Yup this kind of thing piles on and you start to understand why a lot of experienced programmers eventually adopt a "no libraries" policy.

I've also found that depending on my employer, the open source codebases we use often don't actually hold up to the standard of quality that we hold our own code to. But we don't check it because all libraries are assumed to work.

cesarb · on Nov 8, 2021

> but I guess that's the tradeoff from the most sensible heuristic: test your own code first, the bug is there.

Also known as "select is not broken" (see for instance https://blog.codinghorror.com/the-first-rule-of-programming-...).

yjftsjthsd-h · on Nov 8, 2021

Reminds me of: "It Is Never a Compiler Bug Until It Is" (https://r6.ca/blog/20200929T023701Z.html , https://news.ycombinator.com/item?id=24636326). The bottom of the modern stack is really reliable, until it isn't;)

Smoosh · on Nov 8, 2021

Not just "the modern stack". I work mainframes and always felt the IBM-supplied environment (compilers, transaction processing systems, databases) was rock solid.

Then one day I discovered APARs were a thing.

https://www.ibm.com/support/pages/open-apars-ibm-products-av...

twic · on Nov 8, 2021

Similar story with a bug in the IBM JDK's implementation of BigDecimal. Surely if anyone is going to get decimals right it's IBM! Took us a long time to stop looking at our code.

(turns out that IBM do get decimals right if you're running on z/Architecture, where the code diverts to some hardware-accelerated fast path; just not on x86-64 machines used by paupers like my project)

alasdair_ · on Nov 9, 2021

My personal favorite ios bug that I found was in the middle of a finance exam. The ios calculator app gave completely different answers depending on whether or not you tilted the phone slightly during the calculation.

It wasn’t just a visual thing - the calculation itself was changed by the toggling briefly to scientific mode and back again.

I no longer fully trust calculators.

dnautics · on Nov 9, 2021

I found a bug deep in the Erlang internals that apparently came about due to a missed case in an optimization that the team put in. It was not terribly hard to find though, on account that it was tripped by a unit test that was working just fine on an earlier version of BEAM, and thus was also easy to reduce to a minimal case by A/Bing against both versions.

CalChris · on Nov 8, 2021

Debian announcement

https://lists.debian.org/debian-devel/2017/06/msg00308.html

Ahrefs writeup

https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...

The Intel spec update still labels SKL150 as No Fix but there is a microcode update available. Dunno exactly what to make of that distinction.

https://www.intel.com/content/www/us/en/processors/core/desk...

Can an x86 program detect whether this update has been applied? Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?

BeeOnRope · on Nov 8, 2021

It was "fixed" in a microcode update by disabling the loop stream buffer (LSD) which is a special mode of operation for very small loops where the instruction decoders and uop cache in the CPU are shut down and the loop runs directly out of a small cache*. Since the problem arose only when the LSD was being used, in combination with hyperthreading and high byte register use, this effectively avoids the problem.

Of course, disabling the LSD has some costs: CPUs use more power and some loops are slower (though some are faster). These updates are usually applied silently without user consent, so you might quite surprised to find out that after a reboot your computation kernel suddenly draws more power or has slowed down or sped up.

> Can an x86 program detect whether this update has been applied? Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?

Yes. One way would be to check the microcode version (available in /proc/cpuinfo on Linux, among other places), since the version that introduced this fix is known.

Another way would be to run a small loop known to fit in the LSD and then check a performance counter event which counts uops delivered from the LSD, like lsd.uops. This counter is always zero when the LSD is disabled (or realistically you could just run any substantial code and check the counter since you always have some non-neglible portion of the uops coming from the LSD). This is how I check it from the command line in practice.

Finally, if you don't have easy access to the counters, you could create a loop that has a significant performance difference depending on whether it is coming from the LSD or not. For example, a loop that crosses a 32-byte boundary will run 2 or more cycles when using the decoder or uop cache, but could run in 1 cycle in the LSD. Timing such a loop would give you a strong indication about whether the LSD is enabled.

---

* Specifically, the cache used is not a dedicated one, but rather the IDQ (decoded instruction queue) is reused. This queue holds uops and is normally fed by the decoders or the uop cache on one end, and which feeds the allocation/rename engine on the other. In LSD mode, this queue stops being a queue and is instead used as a kind of cache with the loop operations "locked down" in the queue and just repeatedly replayed.

kaladin-jasnah · on Nov 8, 2021

Dumb question, but why is it abbreviated as LS_D_ when it's spelled loop stream _b_uffer?

CalChris · on Nov 8, 2021

It's actually spelled Loop Stream Detector and it dates to the Core 2 processor family which is circa 2006. The LSD is described in section 3.4.2.4 of the Intel Optimization Manual, Optimizing the Loop Stream Detector (LSD). AnandTech describes how it works.

https://www.anandtech.com/show/2594/4

BeeOnRope · on Nov 8, 2021

Yeah that's right. Not sure where I picked up the term "... buffer" but a search shows I've been using it for a while.

jeffbee · on Nov 9, 2021

Probably was just easy to mentally conflate it with Decoded Stream Buffer. It compresses better if you just change Detector to Buffer :-)

SavantIdiot · on Nov 8, 2021

This is a scary place to be: the top-level debug resource for a major project. It took almost two years to resolve, but was already known as SKL150. Looking at the clang vs. gcc assembly without knowledge of SKL150 would be literally impossible to debug. GCC -O1 vs -O2 is a clue, but even with the asm diffs, wth? Again, scary.

tinus_hn · on Nov 8, 2021

The world is a scary place; this is basically the same as rowhammer which is an issue in computers shipped today.

woodruffw · on Nov 8, 2021

Unless I'm misunderstanding what you mean, this isn't really like rowhammer at all -- it's a uarch/ucode bug, which is effectively a programming error within the CPU. Rowhammer is a physical flaw in how memory cells in DRAM are laid out, one that can be triggered by memory access patterns independent of CPU architecture and microarchitecture.

(There are also hundreds of errata like this one in every CPU generation. They're usually not easy to exploit, since they cause system instability rather than disclosing secret material or allowing unintended code execution.)

zsmi · on Nov 8, 2021

> Rowhammer is a physical flaw in how memory cells in DRAM are laid out

It's not really a flaw, more like a consequence of how memory cells are laid out. I mean most people want lots of bits in their DRAM. Maximizing this parameter necessitates that some will be in close proximity.

woodruffw · on Nov 8, 2021

To my (non-EE) mind, the flaw is the electrical leakage between the cells. Tight packing is a consequence of economic forces, but I assume there are also technical solutions that allow for tight packing (but either offset the performance or cost gains). Is that assumption wrong? (Genuinely asking!)

zsmi · on Nov 8, 2021

There was a good paper on it in 2014. [1] They describe the RowHammer attack as: opening and closing (activation and precharge) a DRAM row (aggressor row) at a high enough rate (hammering) such that it can cause bit-flips in physically nearby rows (victim row).

Colloquially, it's basically a change in voltage in one place can indirectly cause a change in voltage in another place via capacitive coupling. Capacitance increases proportional to the inverse of the separating distance so only in recent years have things shrunk to the size that makes it an issue.

Since having less bits in DRAM is basically not an option most mitigation techniques that I know of remove the possibility of hammering: possibilities include the OS, memory system controller, or DRAM controller changes.

[1] https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf

woodruffw · on Nov 8, 2021

Much appreciated, thank you.

tlb · on Nov 8, 2021

DRAM cells also decay over time (~ 60 milliseconds), but memory controllers have some logic to refresh every row on a regular schedule so it's not an issue.

They should also have logic to refresh adjacent rows if some number of consecutive accesses to a small group of rows is detected. This is rare in normal workloads, because those accesses normally come from cache. It's lame of chipmakers to not fix this. The fix would requires the DRAM controller (integrated into modern CPUs) to know more about the internals of DRAMs than they currently do.

zsmi · on Nov 8, 2021

In theory DDR5/LPDDR5 added a controller command for RowHammer mitigation but I haven't had time to research it yet.

See: https://arxiv.org/pdf/2108.06703.pdf

tinus_hn · on Nov 9, 2021

This is a flaw that randomly causes unpredictable things to happen if you rapidly put things in registers. That doesn’t look like a programming error to me.

woodruffw · on Nov 9, 2021

You’ve never seen a programming error triggered by a user rapidly mutating state?

The programming in question here is ucode instead of application code, but it’s programming nonetheless.

tinus_hn · on Nov 10, 2021

Do you have an example of that? All I can think of are weird edge cases with animations.

woodruffw · on Nov 11, 2021

The canonical example that I can think of is the entire "race condition" bug class. But also TOCTTOU, caching bugs, &c. (all of which are arguably also in the race condition category.)

wging · on Nov 8, 2021

Previous submission: https://news.ycombinator.com/item?id=14686277

(This is not a complaint; I found the post interesting.)

dang · on Nov 8, 2021

Thanks! Macroexpanded:

I found a bug in Intel Skylake processors - https://news.ycombinator.com/item?id=14686277 - July 2017 (99 comments)

13of40 · on Nov 8, 2021

> More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries

This was the bane of my existence when I worked on testing Windows years ago. New SDETs almost invariably fell into the trap of assuming any automation error was a "test bug" instead of a bug in OS code, even if the OS code in question was written last week.

userbinator · on Nov 8, 2021

"gcc/clang/icc/msvc won't usually issue the affected opcode pattern and it ends up being rare. SKL150 - Short loops using both the AH/BH/CH/DH registers and the corresponding wide register may result in unpredictable system behavior."

I think Intel should regression-test its CPUs using the decades of demoscene productions out there, especially those in the extreme-size-optimisation categories; testing with almost exclusively "mainstream" compiler output is IMHO a bad idea and a step down the path to "warranty void if VLC is used" (https://news.ycombinator.com/item?id=7205759 )

dimitrios1 · on Nov 8, 2021

Apologies if this is off topic -- but I am constantly impressed at some of the things I find that come from inria.fr. I first came across them when learning OCaml. Seems to be a top notch university.

woodruffw · on Nov 8, 2021

Inria is a research institute, not a university. But they do indeed do excellent work!

pabs3 · on Nov 9, 2021

This reminds me of sandsifter and Google's Silifuzz:

https://github.com/xoreaxeaxeax/sandsifter https://github.com/google/fuzzing/blob/master/docs/silifuzz.... https://www.phoronix.com/scan.php?page=news_item&px=Google-S...

Decabytes · on Nov 8, 2021

I’m glad I’m just a pleb programmer, who never has done anything so complicated that it would expose processor errata.

And even if I did, I wouldn’t have the expertise to even figure it out.

dfox · on Nov 8, 2021

The issue there is that the hardware is full of totally absurd bugs. If you target PC-like userspace or one of the two major mobile platforms it is somebody else's job to shield you from that. In general CPU level bugs are somewhat rare, but every single platform vendor had shipped some kind of silicon that contains peripherals that do not work as documented and only by chance work with the reference driver implementation.

brokenmachine · on Nov 8, 2021

Welcome to the 99.999999999%.

bjarneh · on Nov 8, 2021

> Binary search always fails? “The Java compiler is acting funny today!”

:-)

bruce343434 · on Nov 8, 2021

The link called "6th Generation Intel® Processor Family - Specification Update." 404's

facorreia · on Nov 8, 2021

2017.