Skylake bug: a detective story

sohkamyung · on June 29, 2017

Hardware bugs can be notoriously hard to detect.

One time, during my company's development of a product, we kept encountering a problem while writing a lot of data to an eMMC chip. Was it a problem with the circuit design? Our eMMC controller? Software driver?

It wasn't until we discovered a unit that did not exhibit the problem that we discovered that the problem was isolated to a particular batch of eMMC chips from the vendor (like the OCaml team only realising it was a Skylake problem when they discovered their code worked fine on non-Skylake systems).

laydn · on June 29, 2017

I've been through so many of those. Here's one:

On a board, we were using an ADC chip. We bought the ICs from an official distributor. They sent us the chips, and we populated the boards (~200 of them).

None of the boards worked properly. We debugged for so long. After so much wasted time, we found out the root cause with the help of the manufacturer and the distributor : It turns out, the manufacturer actually produced another IC, but marked the packages with our part number. It just happened the power/gnd pins matched with our IC, so there were no electrical problems.

Another case:

We designed an ASIC and taped-out. The silicon arrived and we then sent it to a packaging house to get the silicon in a QFP package. The packaging house misread our specifications and placed the silicon with 90 degrees rotation in the package. That was fun to debug!

dboreham · on June 29, 2017

The 90 degree rotation thing is surprisingly common. I've seen it done at least twice. Also several instances of edge connectors that were entirely backwards due to being CAD placed on the wrong side of the board, or incorrectly pin nimbered.

jorisgio · on July 3, 2017

http://gallium.inria.fr/blog/intel-skylake-bug/ another great blogpost on the topic

wfunction · on June 29, 2017

Fantastic post. Just one comment:

> This is a fair amount of optimisation passes and it would have been too time consuming to try them one by one

I loved the fact that they used bisection, but then I was shocked to read this. This isn't the case at all unless you're extremely unlucky. If you just do bisection again, you can test with half the flags on and off, keeping the other half off. You can frequently discard half the flags this way, to narrow down the set, and on average it should be much faster than testing each one by one.

CJefferson · on June 29, 2017

Unfortunately that often doesn't work well for optimisation passes, as many passes end up relating to each other.

However, you can still keep removing sets of optimisations and hope the bug remains, and do a final pass where you remove each optimisation one at a time.

Another problem is that in the past I found compiler bugs from running unusual sets of optimisation passes.

pklausler · on June 29, 2017

A useful technique in the design of an optimizing compiler is to include an "odometer test" on all endomorphic discretionary transformations. E.g., at a spot where one has a big go/no-go predicate that determines whether a change to the program is valid and beneficial, add a final term that's a call to a routine that returns true if odometer++ < limit. You can then use bisection on that limit and zero in quickly on the failing change.

This function is also a great place to put logging. I typically make it look like a printf() that returns a Boolean.

benmmurphy · on June 29, 2017

he said page tables were corrupted which should not be possible from usermode so this is presumably a security issue for cloud providers.

zelos · on June 29, 2017

Rendered unreadable on mobile by a huge sticky header and a floating "open in app" button

cube00 · on June 29, 2017

I swear some sites just want mobile responsive design to die so they can load their malware laden apps onto my phone.

rockdoe · on June 29, 2017

So like reddit.

mihaifm · on June 29, 2017

> In late May, devops team noticed a debian package update for intel-microcode

Does this mean that other OS are still susceptible to this? I'm not familiar on how hardware bugs can be fixed at the OS level.

acqq · on June 29, 2017

CPUs have updates too, Intel makes them for their processors, but every modern OS has to add the update and install it on every power on.

wolfgke · on June 29, 2017

> but every modern OS has to add the update and install it on every power on.

This can also be done by the UEFI (though many hardware vendors shy away from releasing UEFI updates, while for OSes this is common routine).

Filligree · on June 29, 2017

If you haven't explicitly enabled microcode updates for your distribution, then chances are good it's disabled. You should check into it.

Most of the time that just requires installing a package, but you might also need to rebuild initramfs. For NixOS, it's just a single flag in configuration.nix as usual.

Upvoter33 · on June 30, 2017

Surprising part for me: team building the thing is so unfamiliar with assembly.

Outrageous · on June 29, 2017

When I was a newbie, I would often say things like "maybe the compiler is broken" or "maybe the cpu is doing it wrong."

Of course, it was never the tools. it was always me.

Years later, of course, I work on unreleased hardware. And yeah, sometimes it's broken. It's really fucking odd. Write C code, and holy shit, it's actually a compiler bug or CPU bug.

Life was simpler when I was just ignorant and egotistical. Now I'm... well, still ignorant and egotistical, I guess, but often deal with actual hardware bugs.

tomsmeding · on June 30, 2017

Finding an assembler bug is also interesting. Marched up against a nasm bug once that kept me debugging for quite a while. The good thing with assembler bugs, however, is that you can decompile the generated machine code to something that is really close to your original source code... That helped a lot.

yuhong · on June 29, 2017

The fun thing is that only one instruction used AH, and it can be done perfectly fine with EAX instead.

acqq · on June 29, 2017

Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?

For the reference, we talk about:

    andb $252, %ah

I believe it's not an effective speed optimization, given the way modern CPUs work, but it does produce a smaller instruction?

Additionally, what I miss in this story is which source code actually produces "I need to clear exactly these two bits while keeping others" in a loop. For that CPU bug to manifest, it has to be inside of "short loops of less than 64 instructions."

wolfgke · on June 29, 2017

> Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?

Indeed: Here a list of encodings of the original instruction and natural alternatives:

> and ah,0xfc: 80E4FC [3 bytes]

> and ax,0xfcff: 6625FFFC [4 bytes]

EDIT: Note that

> and eax,0xfffffcff: 25FFFCFFFF [5 bytes]

does something different: It also zeros the upper 32 bits of the rax register (Exercise: Why?), which is important since the next line

> movq %rax, (%rbx)

depends on the fact that this has not been changed.

acqq · on June 29, 2017

At least, as the CPU bug manifests only when "AH, BH, CH or DH" are used, the compilers could be patched to use the 4 byte variant. But if Intel properly updates all the affected processors, it's not going to be needed.

(Response to wolfgke's edit: don't forget the "sign extend" variants).

jorisgio · on June 29, 2017

We pondered reporting this directly to GCC too so that they can implement a workaround. But despite the buzz generated by this issue in the press I believe Intel is right in saying that this bug is very unlikely to trigger.

They say it can only happen in tight loops, and although the C code is quite large, the hot path has only 17 instructions in the loop, which only sequential memory loads (since the GC is scanning the heap linearly and most ocaml values are small).

GCC actually did quite a good job here : it managed to lay down the code so that the "memory reachable" branch is the shortest one. Indeed, when scanning the major heap in a generational GC assuming that most values scanned are reachable is reasonable.

In this case, the loop condition will likely be predicted true, and the other jumps predicted false, and there is a very tight loop with no random memory load/store. I suspect those conditions are very unlikely.

yuhong · on June 29, 2017

Fortunately, normally immediate constants are still 32-bit even if operating on 64-bit registers, so the cost is only the REX byte.

jorisgio · on June 29, 2017

> Additionally, what I miss in this story is which source code actually produces [...] (this instruction)

The post does not indeed make it clear, but the two instructions

    andb    $252, %ah
    movq    %rax, (%rbx)

come from the default branch of the switch in the C snippet

    Hd_hp (hp) = Whitehd_hd (hd);

This line is full of macros. hp (heap pointer I think ?) points to a value on the heap.

In Ocaml, each value has an additional word in front, and contains several packed parts. There is the size of the block, a tag which represents the value kind, and a few bits are used by the GC to store the "color".

This bit of code reset the color to "white". The code here is part of the GC sweep phase, responsible for deallocating block of unreachable memory. If the block is still reachable, the status is reset so that the next GC phases starts again from clean state.

Hence this instruction only updates the bits storing the color value and should not modify other part of the header like the size of the block and the tag.

acqq · on June 29, 2017

Wouldn't the different layout of these packed bits result in the code which would never have to clear some bits in the second byte? Is that layout fully internal to the Ocaml gc?

jorisgio · on June 29, 2017

I'm not really qualified to explain why such a layout was chosen, my understanding of ocaml internal is limited. The header layout is described here :

https://github.com/ocaml/ocaml/blob/4.04.2/byterun/caml/mlva...

I believe it's internal to the compiler/runtime/C interface (but hidden under abstraction), unless Marshal (an unsafe low level binary serialization format for ocaml values depends on it)

The lowest byte is filled with the tag. I guess it should be possible to swap the color and the tag, or put the color at the beginning before the size, but that probably would have a different tradeoff.

The tag is used for instance in pattern matching and in a lot of code. With this layout it can be accessed easily, which might be the reason the color is put in the second byte

enjolras · on June 29, 2017

Indeed. Looking at the assembly generated it looks like gcc is too smart for its own good.