> It can be helpful to know that, on all modern processors, you can safety assig...

dmm · on June 15, 2016

If you're doing this in assembly, no problem so long as you understand the memory model of the cpu architecture you're targeting.

The memory model implemented by c compilers absolutely do not allow this. The problem is not cache coherency but optimization. The compiler assumes it understands the visibility of the variables and reorders, combines, and eliminates reads and writes. C is not a high-level assembler.

> A safer bet would be to use __atomic_load ... C11 atomics

These should work but be careful. Use the semantics defined by these constructs not the semantics of any assembly you might imagine the compiler generating.

speleo_engr · on June 15, 2016

Sure, but isn't your assembley code going to look a lot like C code with atomics anyway as you use barrier or in order execution instructions (e.g., dmb on ARM or eieio on PPC)?

Unklejoe · on June 15, 2016

> [Using "volatile" should make stores and loads happen where they are written in the code...]

Really? I was under the impression that "volatile" basically did nothing other than prevent the compiler from optimizing out a variable that appears to not ever be written to, but is actually written to via an ISR or something.

Also, how does the compiler usually handle global variables? Since they can be modified by some code that was linked in, does it assume them to be volatile?

Of course, this issue is slightly different than the atomic-ness of certain data types. We're talking more about situations where you have a series of atomic operations which you expect to happen in the order of which they're written.

exDM69 · on June 15, 2016

> Really? I was under the impression that "volatile" basically did nothing other than prevent the compiler from optimizing out a variable that appears to not ever be written to, but is actually written to via an ISR or something.

I think it also stops loads and stores to be reordered with other volatile loads and stores, and yes, it indeed is typically useful for memory mapped i/o, interrupt handlers and such. But most of the time it's not the right thing to do.

> Also, how does the compiler usually handle global variables? Since they can be modified by some code that was linked in, does it assume them to be volatile?

No, they're not considered volatile but may have stricter guarantees with reordering than local variables (esp. when combined with calls to foreign functions).

If multithreaded code needs to be correct and portable, using locks or atomics is the way to go.

vardump · on June 15, 2016

> I think it also stops loads and stores to be reordered with other volatile loads and stores, and yes, it indeed is typically useful for memory mapped i/o, interrupt handlers and such. But most of the time it's not the right thing to do.

It only prevents reordering by the compiler. CPU can still reorder them as much as it pleases, volatile doesn't affect it at all.

Of course you're pretty safe on x86. Not so on other platforms. Typically you find this kind of issues when porting x86 code to some other platform. X86 writes are always in order with other writes. Loads are always in order to other loads. Just need to remember loads are sometimes ordered before stores.

Again, volatile does not prevent CPU from reordering stores and loads.

Dylan16807 · on June 16, 2016

If the compiler isn't preventing CPU reordering, then it has been impossible pre-2011 to use volatile for its original purpose, communicating with memory-mapped device registers. That sounds like a broken compiler to me.

vardump · on June 16, 2016

Yeah, it'd not be very nice when DMA start bit in some DMA control register is poked before DMA address is still unset!

Those architectures that do this typically provide specialized always serialized instructions for communicating with memory mapped devices. If not, you need to have proper fences to ensure right order.

X86 stores are always in order (except for some specific instructions, but they're no concern in this context), so it doesn't need this.

makomk · on June 17, 2016

Memory-mapped device registers are generally mapped in such a way that accesses bypass cache and go straight to memory. This has a fairly large performance penalty, of course.

vvanders · on June 15, 2016

Yup.

Don't. Use. Volatile.

It's meant for reading from hardware registers and nothing else.

There's nothing to prevent the re-ordering of reads/code around your volatile block. This is doubly nasty in that Windows will insert a memory fence which makes everything look like you'd want it. Soon as you port to another platform, boom all sorts of concurrency issues.

Seriously, use atomics. That's what they're built for.

vardump · on June 15, 2016

> This is doubly nasty in that Windows will insert a memory fence

Windows is not inserting any memory fences anywhere in your code. MSVC cl compiler will also only do so when you specifically ask it to do so by using an intrinsic.

vvanders · on June 15, 2016

> Visual Studio interprets the volatile keyword differently depending on the target architecture. ... For architectures other than ARM, if no /volatile compiler option is specified, the compiler performs as if /volatile:ms were specified; therefore, for architectures other than ARM we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives and compiler intrinsics when you are dealing with memory that is shared across threads.

> /volatile:ms

> Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, which might add significant overhead on ARM and other weak memory-ordering architectures. If the compiler targets any platform except ARM, this is default interpretation of volatile.

From https://msdn.microsoft.com/en-us/library/12a04hfd.aspx

Really, just don't use volatile.

vardump · on June 15, 2016

I think I really need to read disassembler output of any code I compile with cl. Ugh.

I haven't seen it doing that on any of my code using volatile, mostly for CAS and fetch-and-add (lock xadd).

biokoda · on June 15, 2016

> A safer bet would be to use __atomic_load and __atomic_store

Safer and an order of magnitude slower.

exDM69 · on June 15, 2016

Sure, but the only way to be sure that a non-atomic store works as intended is to read the assembly output from the compiler.

It's also not very easy to predict the relative difference because it depends on which core(s) the threads get scheduled on. If both threads happen to get on the same core and the data stays in caches, we're talking about 1ns (L1) to 5ns (L2). An atomic load/store would be about 20ns (4..20x slower).

If they happen to get scheduled on different cores, the CPU interconnect will have to deal with its cache coherency protocol (similar to atomic load/store) and the result will be around the same.

We're talking about pretty miniscule amount of time here. It's a choice between fast if you're lucky but possibly incorrect anyway or marginally slower but guaranteed correctness.

Unklejoe · on June 15, 2016

Wouldn't this be the case anyway (cache coherency overhead) even if we used types that just-so-happened to be atomic on the platform, such as "int"?

In other words, isn't this a non issue when comparing the performance of atomic types and "regular" types, as they both would require this?

Also, do the atomic types handle ordering?

I could be going crazy, but I remember seeing some of those atomic types simply being typedef'd to "int" in some OS code before.

EDIT: I was thinking of the "atomic_t, atomic_inc, atomic_set, etc..." in the Linux Kernel. It guarantees atomic behavior, but not ordering. Memory barriers are still required when using it between different threads. The C11 types appear to be different, as they accept an ordering argument in their load/store functions.

anarazel · on June 15, 2016

Many of them actually do provide guarantees (as e.g. evidenced that a typical implementation of an x86 memory barrier is an "lock; addl $0,0(%%rsp)"). That's why there's now stuff like __smp_mb__before_atomic()/__smp_mb__after_atomic(), to avoid unnecessary atomic ops.

jblow · on June 15, 2016

No. The way to ensure that your store works atomically is to specify the assembly output of the compiler. Reading what it is outputting right now doesn't matter, because how do you know when it is going to change its mind?

exDM69 · on June 15, 2016

You don't know. That's why it's a terrible solution.

No-one inspects their binaries after every recompilation or compiler (flag) change.

anarazel · on June 15, 2016

Why would a relaxed load/store be slower? They'll usually just compile down to a mov or whatever the equivalent is?

Bromskloss · on June 15, 2016

Do you mean that a compiler optimisation could cause tearing of the mentioned data types? That's surprising to me and I'd like to know more about it.

biokoda · on June 15, 2016

Not tearing, but it could reorder the assignment so that it happens someplace else than you think it will. For this kind of thread synchronization memory barriers should be used at least.

exDM69 · on June 15, 2016

Exactly.

To give a concrete example (but artificial) what might happen, consider this case:

    global_flag = 0;
    for(this loop takes a long time) { ... } 
    global_flag = 1;
    for(this loop takes a long time) { ... }
    global_flag = 0;

If it is trying to communicate something to another thread (or audio callback) with global_flag, it might fail miserably because the compiler is free to drop two of the assignments and place `global_flag = 0` wherever in the code. Using `volatile` should prevent this compiler optimization but it doesn't give any guarantees about cache coherency, etc.

As far as I know, using atomic load and store is the only way (apart from locks) to guarantee correctness and portability.

cesarb · on June 15, 2016

Elaborating on your example.

    my_buffer->something = 42;
    global_pointer = my_buffer;
    global_flag = 1;

Even if everything is declared `volatile`, in C or C++, the compiler or the processor can reorder the stores and reads, so that the reading thread can see the "global_pointer = my_buffer" before the "->something = 42". The only safe way to do it is to add the appropriate memory barriers on both the writing and reading sides, which force the compiler and the processor to not reorder the writes/reads.

    my_buffer->something = 42;
    write_barrier(); // Not the actual function; will vary depending on your environment.
    global_pointer = my_buffer;
    write_barrier(); // Not the actual function; will vary depending on your environment.
    global_flag = 1;

And on the reading side, the corresponding read barriers.

It's simpler to use atomic loads and stores or locks, since their implementation already has the required barriers with all the details (quick: what's the difference between an acquire and a release atomic access?).

(Note that this is different in Java, where `volatile` always implies a memory barrier.)

exDM69 · on June 15, 2016

> what's the difference between an acquire and a release atomic access

ACQUIRE prevents reordering any loads and stores from after the barrier to before. RELEASE is the opposite, no load or store before it may happen after the barrier.

ACQUIRE is what you use when locking a spinlock, RELEASE is when you unlock.

Put these incorrectly and you're not guaranteed to have all loads and stores happening when the spinlock is locked, and your program is no longer guaranteed to behave as expected.

Bromskloss · on June 15, 2016

I thought _tearing_ referred to a value being read while in the midst of being updated, so that, for example, some of the read bits comes from the old value and some from the new value.

barrkel · on June 15, 2016

It does. It's a distinct issue and still very relevant if you're using shared memory without locks to communicate between threads.

_pmf_ · on June 15, 2016

I think the extreme case would be a packed struct where a multibyte member happens to span cache lines. This would not be causedby compiler optimization, but by manual deoptimization.