One time, during my company's development of a product, we kept encountering a problem while writing a lot of data to an eMMC chip. Was it a problem with the circuit design? Our eMMC controller? Software driver?
It wasn't until we discovered a unit that did not exhibit the problem that we discovered that the problem was isolated to a particular batch of eMMC chips from the vendor (like the OCaml team only realising it was a Skylake problem when they discovered their code worked fine on non-Skylake systems).
On a board, we were using an ADC chip. We bought the ICs from an official distributor. They sent us the chips, and we populated the boards (~200 of them).
None of the boards worked properly. We debugged for so long. After so much wasted time, we found out the root cause with the help of the manufacturer and the distributor : It turns out, the manufacturer actually produced another IC, but marked the packages with our part number. It just happened the power/gnd pins matched with our IC, so there were no electrical problems.
Another case:
We designed an ASIC and taped-out. The silicon arrived and we then sent it to a packaging house to get the silicon in a QFP package. The packaging house misread our specifications and placed the silicon with 90 degrees rotation in the package. That was fun to debug!
The 90 degree rotation thing is surprisingly common. I've seen it done at least twice. Also several instances of edge connectors that were entirely backwards due to being CAD placed on the wrong side of the board, or incorrectly pin nimbered.
> This is a fair amount of optimisation passes and it would have been too time consuming to try them one by one
I loved the fact that they used bisection, but then I was shocked to read this. This isn't the case at all unless you're extremely unlucky. If you just do bisection again, you can test with half the flags on and off, keeping the other half off. You can frequently discard half the flags this way, to narrow down the set, and on average it should be much faster than testing each one by one.
Unfortunately that often doesn't work well for optimisation passes, as many passes end up relating to each other.
However, you can still keep removing sets of optimisations and hope the bug remains, and do a final pass where you remove each optimisation one at a time.
Another problem is that in the past I found compiler bugs from running unusual sets of optimisation passes.
A useful technique in the design of an optimizing compiler is to include an "odometer test" on all endomorphic discretionary transformations. E.g., at a spot where one has a big go/no-go predicate that determines whether a change to the program is valid and beneficial, add a final term that's a call to a routine that returns true if odometer++ < limit. You can then use bisection on that limit and zero in quickly on the failing change.
This function is also a great place to put logging. I typically make it look like a printf() that returns a Boolean.
If you haven't explicitly enabled microcode updates for your distribution, then chances are good it's disabled. You should check into it.
Most of the time that just requires installing a package, but you might also need to rebuild initramfs. For NixOS, it's just a single flag in configuration.nix as usual.
When I was a newbie, I would often say things like "maybe the compiler is broken" or "maybe the cpu is doing it wrong."
Of course, it was never the tools. it was always me.
Years later, of course, I work on unreleased hardware. And yeah, sometimes it's broken. It's really fucking odd. Write C code, and holy shit, it's actually a compiler bug or CPU bug.
Life was simpler when I was just ignorant and egotistical. Now I'm... well, still ignorant and egotistical, I guess, but often deal with actual hardware bugs.
Finding an assembler bug is also interesting. Marched up against a nasm bug once that kept me debugging for quite a while. The good thing with assembler bugs, however, is that you can decompile the generated machine code to something that is really close to your original source code... That helped a lot.
Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?
For the reference, we talk about:
andb $252, %ah
I believe it's not an effective speed optimization, given the way modern CPUs work, but it does produce a smaller instruction?
Additionally, what I miss in this story is which source code actually produces "I need to clear exactly these two bits while keeping others" in a loop. For that CPU bug to manifest, it has to be inside of "short loops of less than 64 instructions."
> Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?
Indeed: Here a list of encodings of the original instruction and natural alternatives:
> and ah,0xfc: 80E4FC [3 bytes]
> and ax,0xfcff: 6625FFFC [4 bytes]
EDIT: Note that
> and eax,0xfffffcff: 25FFFCFFFF [5 bytes]
does something different: It also zeros the upper 32 bits of the rax register (Exercise: Why?), which is important since the next line
> movq %rax, (%rbx)
depends on the fact that this has not been changed.
At least, as the CPU bug manifests only when "AH, BH, CH or DH" are used, the compilers could be patched to use the 4 byte variant. But if Intel properly updates all the affected processors, it's not going to be needed.
(Response to wolfgke's edit: don't forget the "sign extend" variants).
We pondered reporting this directly to GCC too so that they can implement a workaround. But despite the buzz generated by this issue in the press I believe Intel is right in saying that this bug is very unlikely to trigger.
They say it can only happen in tight loops, and although the C code is quite large, the hot path has only 17 instructions in the loop, which only sequential memory loads (since the GC is scanning the heap linearly and most ocaml values are small).
GCC actually did quite a good job here : it managed to lay down the code so that the "memory reachable" branch is the shortest one. Indeed, when scanning the major heap in a generational GC assuming that most values scanned are reachable is reasonable.
In this case, the loop condition will likely be predicted true, and the other jumps predicted false, and there is a very tight loop with no random memory load/store. I suspect those conditions are very unlikely.
> Additionally, what I miss in this story is which source code actually produces [...] (this instruction)
The post does not indeed make it clear, but the two instructions
andb $252, %ah
movq %rax, (%rbx)
come from the default branch of the switch in the C snippet
Hd_hp (hp) = Whitehd_hd (hd);
This line is full of macros. hp (heap pointer I think ?) points to a value on the heap.
In Ocaml, each value has an additional word in front, and contains several packed parts. There is the size of the block, a tag which represents the value kind, and a few bits are used by the GC to store the "color".
This bit of code reset the color to "white". The code here is part of the GC sweep phase, responsible for deallocating block of unreachable memory. If the block is still reachable, the status is reset so that the next GC phases starts again from clean state.
Hence this instruction only updates the bits storing the color value and should not modify other part of the header like the size of the block and the tag.
Wouldn't the different layout of these packed bits result in the code which would never have to clear some bits in the second byte? Is that layout fully internal to the Ocaml gc?
I believe it's internal to the compiler/runtime/C interface (but hidden under abstraction), unless Marshal (an unsafe low level binary serialization format for ocaml values depends on it)
The lowest byte is filled with the tag. I guess it should be possible to swap the color and the tag, or put the color at the beginning before the size, but that probably would have a different tradeoff.
The tag is used for instance in pattern matching and in a lot of code. With this layout it can be accessed easily, which might be the reason the color is put in the second byte
One time, during my company's development of a product, we kept encountering a problem while writing a lot of data to an eMMC chip. Was it a problem with the circuit design? Our eMMC controller? Software driver?
It wasn't until we discovered a unit that did not exhibit the problem that we discovered that the problem was isolated to a particular batch of eMMC chips from the vendor (like the OCaml team only realising it was a Skylake problem when they discovered their code worked fine on non-Skylake systems).