The paper seems to make no mention of the memory barriers in play, merely talking about the atomic operation and the conditional branch. Despite that, they indicate they produced a proof-of-concept exploit leaking data.
Traditionally, you would use a acquire, L-LS, barrier on entry to a critical section to order any memory accesses in the critical section after the load of the atomic operation completes. Without that, the critical section may speculatively execute on your core while somebody else holds the lock and those changes would be confirmed if the owner releases before your delayed acquiring load completes.
For this paper to be true, either their synchronization primitives do not have acquire barriers, which is wrong to start with, or the acquire barrier must not be stalling later memory accesses which is how I have seen it portrayed. How would that even work correctly on normal code?
To preserve the L-L ordering implicit in the L-LS ordering, you need to guarantee that any loads after the acquiring load will see all stores explicitly ordered before the releasing store. You would need to guarantee that any speculative load must at least see the value the memory would have immediately preceding the release barrier on the other core even if you happened to load that value arbitrarily long before the release barrier on the other core. I guess you could track every single speculative access after a L-* barrier and their dependency chain and then invalidate and re-compute as you see stores until you resolve all pertinent operations prior to your L-* barrier (such as the acquiring load in this case)? If the chip people are actually doing that, that seems... wild.
The concept of ordering is independent of architecture. What operations need to be ordered relative to other operations to be correct does not change. But, your architecture may implicitly guarantee certain orderings so you do not need to explicitly guarantee those orderings.
If I remember correctly, x86 basically gives you everything except S-L automatically. That means a traditional acquire, L-LS, and traditional release, LS-S, barrier should be "free" and require no explicit instructions. I forgot about that, so it makes sense why they would not mention the presence of explicit barriers. But, that does not change the fact that you need that ordering to guarantee correctness, it is just built-in and automatic.
The oddity here is that speculative loads across a L-LS ordering requires some pretty deep black magic to be going on at the hardware level. I have no idea how they could safely reorder stores across a L-S ordering and still maintain "as-if" the store happens after the load, and even safely reordering loads across a L-L ordering and maintaining the "as-if" seems pretty arcane. The fact that the paper is claiming their PoC works and thus the reordering must be happening is wild. Either the processor is not correctly implementing the declared memory ordering semantics, or they are doing some real reordering magic.
This kind of memory order speculatiom is basically required on x86 since the strong semantics would otherwise prevent many useful reorderings (especially L-L which is absolutely critical).
The basic way it works is that pretty much any reordered is allowed, speculatively, even if it violates the explicit or implicit barriers, but then the operations are tracked in a memory order buffer (MOB) until retirement which can detect whether the reordering was detectable and if so flush the pipeline (so-called "memory order nuke", visible in performance counters).
Memory order is only preserved on the architectural level, but on the microarchitectural level, all kinds of reordering, optimizations, and speculations are done to maximize the execution thought. Section 7 of the paper I'm linking below could help to understand how these microarchitectural optimizations could break the memory order beyond the rollback point, causing indeed a Machine Clear (Nuke) of the entire pipeline. We have analyzed the entire class of Machine Clear and showed how the different types can create transient execution paths where memory could be transiently leaked.
Spectre-v1 really is the ghost that keeps haunting us. All the mitigations I'm aware of work by containing the domain, for example inter-process boundaries together with the MMU to limit the leaked surface. How are we developers supposed to reason about code where most conditions break the invariants we encoded?
This is described pretty well in the paper, but briefly: the "race condition" which they exploit happens because the CPU speculatively (and incorrectly) thinks it acquired a lock. Multiple threads being in a critical section can be exploitable and lead to arbitrary read/write and malicious code execution (one case demonstrated in the paper is some random NFC function). The researchers abuse the fact that the critical section can be entered speculatively to gain speculative malicious code execution, which they then use to construct a gadget which leaks arbitrary kernel memory into side channels.
I know what speculative execution is and have some understanding of spectre already. What I don't get is what it means to exploit a speculative race condition.
So the thing with speculative attacks is that you can use an invariant that is broken speculatively to leak things. In this case they use a speculative race condition to gain speculative code execution, much like you can use a normal race condition to gain real code execution. Arbitrary speculative code execution can be used to leak data which can be picked up via side channels.
Traditionally, you would use a acquire, L-LS, barrier on entry to a critical section to order any memory accesses in the critical section after the load of the atomic operation completes. Without that, the critical section may speculatively execute on your core while somebody else holds the lock and those changes would be confirmed if the owner releases before your delayed acquiring load completes.
For this paper to be true, either their synchronization primitives do not have acquire barriers, which is wrong to start with, or the acquire barrier must not be stalling later memory accesses which is how I have seen it portrayed. How would that even work correctly on normal code?
To preserve the L-L ordering implicit in the L-LS ordering, you need to guarantee that any loads after the acquiring load will see all stores explicitly ordered before the releasing store. You would need to guarantee that any speculative load must at least see the value the memory would have immediately preceding the release barrier on the other core even if you happened to load that value arbitrarily long before the release barrier on the other core. I guess you could track every single speculative access after a L-* barrier and their dependency chain and then invalidate and re-compute as you see stores until you resolve all pertinent operations prior to your L-* barrier (such as the acquiring load in this case)? If the chip people are actually doing that, that seems... wild.