> That means the L1 (and probably L2) caches are flushed every time you read or ...

dragontamer · on June 5, 2019

You bring up some good points, and have forced me to consult documentation to double-check a lot of intricate details.

I'm using the following as a reference: https://static.docs.arm.com/100941/0100/armv8_a_memory_syste...

> Fences for sequential consistency are certainly heavy on ARM and it will show up as poor performance, I'd be very surprised if any aarch64 processor flushed any cache on LDAR/STAR.

I agree. But LDAR / STLA are weaker than full sequential consistency. They're only acquire/release consistency.

So its the DMB ISH that I'm curious about. If the memory is never flushed out of L1 cache, how can you possibly establish a total sequential ordering with other threads?

EDIT: I had a bad example here originally. Lets use the SeqCst example from here instead: https://en.cppreference.com/w/cpp/atomic/memory_order#Sequen...

    std::atomic<bool> x = {false};
    std::atomic<bool> y = {false};
    std::atomic<int> z = {0};
     
    void write_x()
    {
        x.store(true, std::memory_order_seq_cst);
    }
     
    void write_y()
    {
        y.store(true, std::memory_order_seq_cst);
    }
     
    void read_x_then_y()
    {
        while (!x.load(std::memory_order_seq_cst))
            ;
        if (y.load(std::memory_order_seq_cst)) {
            ++z;
        }
    }
     
    void read_y_then_x()
    {
        while (!y.load(std::memory_order_seq_cst))
            ;
        if (x.load(std::memory_order_seq_cst)) {
            ++z;
        }
    }
     
    int main()
    {
        std::thread a(write_x);
        std::thread b(write_y);
        std::thread c(read_x_then_y);
        std::thread d(read_y_then_x);
        a.join(); b.join(); c.join(); d.join();
        assert(z.load() != 0);  // will never happen
    }

Variable x is being written to by "write_x". While variable y is being written to by "write_y" thread. Assume they're on different cache-lines.

* Thread C may see "X=True THEN Thread C THEN Y=True" (Formally: Thread A happened before Thread C happened before Thread B). In this case, Z=0.

* Thread D may see instead "Y=True THEN Thread D THEN X=True" (Formally: Thread B happened before Thread D happened before Thread A). In this case, Z=0.

This apparent contradiction is possible in Acquire-Release consistency, but not possible in Sequential Consistency. LDRA and STLR provide only acquire-release consistency.

So now lets ask: how do you implement sequential consistency?

There's no way for Thread A or B to do things correctly from their side. Thread A independently writes "X" in its own leisure, not knowing that Thread "B" is writing a "related" variable somewhere else called Y. There is no opportunity for A or B to coordinate with each other: they have no idea the other thread exists.

---------

To solve this problem, I imagine that an MESI cache will simply invalidate the cache (which "forcibly ejects" the data out of the core running Thread A and Thread B). Whatever order Thread A and Thread B create (X then Y... or Y then X) will be resolved in L3 cache or DDR4 RAM.

Now... perhaps there's a modern optimization that doesn't require flushing the cache? But that's my mental model of sequential-consistency. Its slow, but that's the only way I'm aware of that establishes the strong SeqCst guarantee.

---------

However, you are right to challenge me, because I don't know. If it makes you feel better, I'll "retreat" to the more easily defended statement: "Load/Store Buffers of the CPU Core need to be flushed every time you SeqCst load/store on ARM systems" (which should effectively kill out-of-order execution in the core).

The question I have is how does L3 cache perceive a total-ordering without the L1 cache flushing? I guess you've given me the thought that its possible for it to happen, but I'm not seeing how it would be done.

BeeOnRope · on June 5, 2019

There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of. Such a system would perform so slowly as to be practically unusable.

At most there is usually a local dispatch stall while store buffers (not caches) are flushed and pending loads complete. In some scenarios, like a cache-line crossing atomic operation on x86, you might need additional non-local work such as asserting a lock signal on an external bus. There might be some other work such as ensuring that all arrived invalidation requests have been processed.

Still, you are talking in the range of 10s or maybe low 100s of cycles. Nothing like a cache flush which would probably be an impact of 10,000s of cycles or more (depending on how large the caches are and where the data to reload them comes from).

gpderetta · on June 5, 2019

Exactly.

I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for example), as far as I know do not usually actually flush the store buffer per-se. They just stall the pipeline (technically they only need to stall any load) until all stores prior to the fence have been flushed out by the normal store buffer operations, i.e. the store buffer is always continuously flushing as fast as possible all the time.

You didn't imply otherwise, but I wanted to clarify that because I have seen comments elsewhere and in code claiming that a fence would make a prior store visible faster (i.e. the fence was added to improve latency instead of being required for correctness), which I do not think it is the case, at least at a microarchitectural level (things are more complex when a compiler is involved of course).

BeeOnRope · on June 6, 2019

Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).

gpderetta · on June 7, 2019

> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.

BeeOnRope · on June 6, 2019

Sorry that should say "(like lfence)" not "(like mfence)".

dragontamer · on June 5, 2019

> There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of.

Good to know. I've seen enough of your other posts to trust you at your word.

BTW: I'll have you know that AMD GPUs implement "__threadfence()" as "buffer_wbinvl1_vol", which is documented as:

> buffer_wbinvl1_vol -- Write back and invalidate the shader L1 only for linesthat are marked volatile. Returns ACK to shader.

So I'm not completely making things up here. But yes, I'm currently looking at some GPU assembly which formed the basis of my assumption. So you can mark at least ONE obscure architecture that pushes data out of the L1 cache on a fence / memory barrier!

tntn · on June 5, 2019

GPU L1 caches are typically not coherent, so flushing them is necessary on GPU architectures.

BeeOnRope · on June 6, 2019

Right, I could have been clearer that I was restricting my comments to general purpose CPUs.

tntn · on June 5, 2019

As a minor note, LDAR and STLR are poorly named and are documented as providing multicopy atomicity in addition to acquire-release semantics and are the recommended way to implement C11 SeqCst.

Some of the below is almost certainly incorrect, apologies in advance.

Your example is more or less the Independent Read Independent Write (IRIW+syncs) litmus test. The short version is you don't need to flush any caches to get IRIW to always obey sequential consistency.

Originally, ARMv8 was non-multicopy atomic, but AFAIK there was never any non-multicopy atomic implementation, and a revised spec requires multicopy atomicity. This matters in terms of thinking about how sequential consistency for IRIW is implemented, but it works either way.

The core idea is that the DMB ISH ensures that the value that the first reads in threads c and d receive are visible to all threads. With multicopy atomicity this is free - e.g. if Thread C reads true from X, then all other threads will also read true.

(-> below indicates a sequenced before relationship)

So you get

Thread C: read X as true -> dmb ish -> read Y as false

Thread D: read Y as true -> dmb ish -> read X as false

However, you get some extra edges. If either thread reads X as false, then X as true is not visible to any thread. So you get

(Thread C reads y as false) -> (Thread D reads Y as true)

(Thread D reads x as false) -> (thread c reads x as true)

These are enough to produce a cycle that forbids the z=0 case.

Essentially, threads A and B have to make sure that once any thread sees its write, all threads see its write. (Or alternatively the dmb ish on thread C and D have to ensure that future memory accesses from that thread don't proceed until the prior read is consistent to all cores).

dragontamer · on June 5, 2019

Thanks for the in-depth answer!

> Essentially, threads A and B have to make sure that once any thread sees its write, all threads see its write

Gotcha. So that's where cache snooping comes in and why that's sufficient to handle the case. I think I see it now. Again, thanks for breaking it down for me. So it appears the only thing necessary is for the DMB ISH to ensure that all cache-snoop messages are "processed" before proceeding.