The fun thing is that only one instruction used AH, and it can be done perfectly...

acqq · on June 29, 2017

Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?

For the reference, we talk about:

    andb $252, %ah

I believe it's not an effective speed optimization, given the way modern CPUs work, but it does produce a smaller instruction?

Additionally, what I miss in this story is which source code actually produces "I need to clear exactly these two bits while keeping others" in a loop. For that CPU bug to manifest, it has to be inside of "short loops of less than 64 instructions."

wolfgke · on June 29, 2017

> Wouldn't the instruction be bigger if EAX was used? I guess it would have to have 4 byte constant then, with this encoding it's one byte. How would you clear the 2 lowest bits in the second byte of the EAX?

Indeed: Here a list of encodings of the original instruction and natural alternatives:

> and ah,0xfc: 80E4FC [3 bytes]

> and ax,0xfcff: 6625FFFC [4 bytes]

EDIT: Note that

> and eax,0xfffffcff: 25FFFCFFFF [5 bytes]

does something different: It also zeros the upper 32 bits of the rax register (Exercise: Why?), which is important since the next line

> movq %rax, (%rbx)

depends on the fact that this has not been changed.

acqq · on June 29, 2017

At least, as the CPU bug manifests only when "AH, BH, CH or DH" are used, the compilers could be patched to use the 4 byte variant. But if Intel properly updates all the affected processors, it's not going to be needed.

(Response to wolfgke's edit: don't forget the "sign extend" variants).

jorisgio · on June 29, 2017

We pondered reporting this directly to GCC too so that they can implement a workaround. But despite the buzz generated by this issue in the press I believe Intel is right in saying that this bug is very unlikely to trigger.

They say it can only happen in tight loops, and although the C code is quite large, the hot path has only 17 instructions in the loop, which only sequential memory loads (since the GC is scanning the heap linearly and most ocaml values are small).

GCC actually did quite a good job here : it managed to lay down the code so that the "memory reachable" branch is the shortest one. Indeed, when scanning the major heap in a generational GC assuming that most values scanned are reachable is reasonable.

In this case, the loop condition will likely be predicted true, and the other jumps predicted false, and there is a very tight loop with no random memory load/store. I suspect those conditions are very unlikely.

yuhong · on June 29, 2017

Fortunately, normally immediate constants are still 32-bit even if operating on 64-bit registers, so the cost is only the REX byte.

jorisgio · on June 29, 2017

> Additionally, what I miss in this story is which source code actually produces [...] (this instruction)

The post does not indeed make it clear, but the two instructions

    andb    $252, %ah
    movq    %rax, (%rbx)

come from the default branch of the switch in the C snippet

    Hd_hp (hp) = Whitehd_hd (hd);

This line is full of macros. hp (heap pointer I think ?) points to a value on the heap.

In Ocaml, each value has an additional word in front, and contains several packed parts. There is the size of the block, a tag which represents the value kind, and a few bits are used by the GC to store the "color".

This bit of code reset the color to "white". The code here is part of the GC sweep phase, responsible for deallocating block of unreachable memory. If the block is still reachable, the status is reset so that the next GC phases starts again from clean state.

Hence this instruction only updates the bits storing the color value and should not modify other part of the header like the size of the block and the tag.

acqq · on June 29, 2017

Wouldn't the different layout of these packed bits result in the code which would never have to clear some bits in the second byte? Is that layout fully internal to the Ocaml gc?

jorisgio · on June 29, 2017

I'm not really qualified to explain why such a layout was chosen, my understanding of ocaml internal is limited. The header layout is described here :

https://github.com/ocaml/ocaml/blob/4.04.2/byterun/caml/mlva...

I believe it's internal to the compiler/runtime/C interface (but hidden under abstraction), unless Marshal (an unsafe low level binary serialization format for ocaml values depends on it)

The lowest byte is filled with the tag. I guess it should be possible to swap the color and the tag, or put the color at the beginning before the size, but that probably would have a different tradeoff.

The tag is used for instance in pattern matching and in a lot of code. With this layout it can be accessed easily, which might be the reason the color is put in the second byte

enjolras · on June 29, 2017

Indeed. Looking at the assembly generated it looks like gcc is too smart for its own good.