There's zero benefit to just writing one byte. I'm having a hard time believing ...

PeCaN · on Sept 23, 2019

Conceivably there's a benefit when you can't load the value you want in one instruction (e.g. on many RISC architectures with fixed 32-bit instructions you can't have a 32-bit immediate)

for example on Alpha the most general way to load an arbitrary unsigned 32-bit value is

    LDAH reg, XXXX(r31)  ; upper 16 bits
    LDA  reg, YYYY(reg)  ; lower 16 bits

Suppose you have code that toggles some value in memory between, say, 65537 and 65538, e.g.

    x->y = 65537;
    // later in the same function
    x->y = (x->y == 65537 ? 65538 : 65537);
    // do something later assuming x->y is either 65537 or 65538

Conceivably, the compiler could emit

    LDAH r22, 0x0001(r31) ; r22 = 0x0000 0000 0001 YYYY
    LDA  r22, 0x0001(r22) ; r22 = 0x0000 0000 0001 0001

for the first load, write it to shared memory and reload it later, and then for the comparison could emit

    ; load r22 = x->y
    BLBS r22, toggle_1   ; if (r22 & 1) goto toggle_1;
    LDA r22, 0x0001(r22) ; r22 = (r22 & 0xffff0000) | 0x0001
    BR toggle_done       ; goto toggle_done;
    toggle_1:
    LDA r22, 0x0002(r22) ; r22 = (r22 & 0xffff0000) | 0x0002
    toggle_done:
    ; do something assuming r22 is either 65537 or 65538

since it knows the original LDAH will be shared by either of the new values, it can save an instruction on each path.

And so, conceivably, if another thread changed the high 16 bits of x->y between the original store and the load before the comparison, we could observe a mixed value.

Of course, you'd have to create a condition where the compiler writes r22 back to memory and then loads it again but assumes the loaded value is still >= 65536 and < 131072.

Is this contrived? Absolutely. Is it _plausible_? Maybe.

―

Disclaimer: I've never used an Alpha in my life. This is all inferred from Raymond Chen's really amazing series about processors that Windows NT used to support. [0]

[0] https://devblogs.microsoft.com/oldnewthing/20170807-00/?p=96...

nitrogen · on Sept 23, 2019

I think the point is that unless the language guarantees atomicity (and the compiler implements the guarantee correctly), counterintuitive behavior is permissible and somewhat common, and compiler behavior varies across and within CPU architectures.

reitzensteinm · on Sept 23, 2019

Actually I agree with the general point, OP is right that you should use volatile. I was just a bit taken back by the specific example above.

However, as PeCaN has pointed out, I'm probably suffering from tunnel vision from x86 monoculture.

lalaland1125 · on Sept 23, 2019

For reference, volatile is still the wrong choice for thread safety. If you need your race condition semantics to be well defined, you must use atomics (do note that those atomics might compile down to simple assignments if the platform supports that, but only the compiler is allowed to do that).

vnorilo · on Sept 23, 2019

Volatile is, well, volatile. You can basically only rely on the fact that the compiler respects source order and keeps all reads and writes (instead of hoisting and reordering).

I believe msvc used to put in memory barriers. Not sure if it still does. These days you are better off ignoring the entire keyword and using the properly specified atomic stuff.

jcelerier · on Sept 23, 2019

> I believe msvc used to put in memory barriers. Not sure if it still does.

it does but now there's a flag to disable them : /volatile:iso (https://docs.microsoft.com/en-us/cpp/build/reference/volatil...).

PeCaN · on Sept 23, 2019

For the record I think the specific example from the OP is rather silly and of course never an optimization on x86(_64), but I'd never pass up a chance to talk about more fun architectures ;-)

pmjordan · on Sept 23, 2019

I've seen variable size accesses (sometimes 32-bit, sometimes bytewise) to memory locations (almost certainly field access via a dereferenced struct pointer, i.e. operator ->) when reverse engineering x86-64 code known to have been compiled with clang. However, that was very specifically for bitwise access - I don't know if the original source code used C's bitfield struct syntax or explicit masking. I also don't recall if it was just reads whose access size was inconsistent or writes too. It's almost certainly a code size optimisation when comparing to or storing immediate (compile-time constant) values.

Either way, if targeting a modern toolchain, I would strongly recommend using atomic_store()/atomic_load() from <stdatomic.h> for exchanging data between threads when "stronger" interlocked operations (CAS, exchange, atomic arithmetic, etc.) aren't required.

Be very, very careful with the memory order argument when using the "explicit" versions of these functions. Apple subtly broke their IOSharedDataQueue (lock-free & wait-free circular buffer queue between kernel and user space, used for HID events, for example) in macOS 10.13.6 (fixed in 10.14.2 I think) because they used memory_order_relaxed for accessing the head and tail pointers when transitioning to the stdatomic API from OSAtomic. Presumably in a misguided attempt at "optimisation". The writer would only wake up the reader if the head/tail state before or after writing the message to the buffer indicated an empty queue. Unfortunately, due to memory access reordering in the CPU, the writer ended up sometimes seeing a state of memory that never existed on the reader thread, so the reader would wait indefinitely for the writer to wake it up after draining the queue.

The use of volatile for exchanging data between threads, as recommended elsewhere on this thread, is iffy - make sure you know exactly what volatile means to the compiler you are using and the version of the C/C++ standard you are targeting.