This Is Why They Call It a Weakly-Ordered CPU

ComputerGuru · on Oct 19, 2012

Nice blog post, though I personally prefer the ridiculousfish post [0] he links to in the end, that one's an instant classic.

He mentions Windows/x86 a couple of times. I only wish it were as simple as "this platform does not reorder." Having done low-level, heavily-multithreaded work on Windows for years: it'll behave like a strongly-ordered architecture 999 times out of a 1000 (or more). Then it'll bite you in the ass and so something unexpected. Basically, if you're doing your own synchronization primitives on x86, you have to pretty much rely on visual/theoretical verification because tests won't error out w/ enough consistency. I've run a test (trying to get away w/ not using certain acquire/release semantics) for an entire week to have it error out only at the last second (x86_64). Other times, I've shipped code that's been tested and vetted inside out for months, only to have the weirdest bug reports 3 or 4 months down the line in the most sporadic cases.

0: http://ridiculousfish.com/blog/posts/barrier.html

Dylan16807 · on Oct 19, 2012

Well, to be specific about x86:

Reads always happen in the original order.

Writes always happen in the original order.

Reads can be moved in front of writes.

callan · on Oct 19, 2012

I work for Intel. This is not correct. A lot depends on your cache type. The two basic ones are uncacheable and write-back.

What you wrote is true for UC. For WB, reads can happen in any order (especially due to cache pre-fetchers). Writes always happen in program order, unless you are in other cache types such as write-combining. WC is mainly used for graphics memory-mapped pixmaps, where the order doesn't matter.

But don't let this scare you too much. From the viewpoint of a single CPU, everything is in order. It's only when you look at it from the memory bus point-of-view that things get confusing.

Unless we are dealing with a memory-mapped IO device that has read side-effects, in which case you need to carefully choose a cache type.

Dylan16807 · on Oct 19, 2012

Well yes as far as WB I meant from the view of the program. Modern CPUs do all kinds of terrifying rewriting and speculation internally.

dspeyer · on Oct 20, 2012

How do I control my cache type?

yuhong · on Oct 20, 2012

I don't think you can from a user mode program. If you are dealing with normal memory, it will be typically WB, and that is the only thing most user-mode programs will encounter.

dchichkov · on Oct 20, 2012

__builtin_prefetch

ComputerGuru · on Oct 19, 2012

Yeah, it's that last one: "can" but usually won't.

nkurz · on Oct 19, 2012

I'm oddly uncomfortable with this article. It reinforces the idea of Memory Ordering as voodoo, rather than as something that can (and needs to be!) understood to properly write low level multicore code. Neither it nor the linked articles go into any details of how memory and cores actually interact, and without these details it would be very hard to get from "this seems to work" to "this is bug free".

You can try running the sample application on any Windows, MacOS or Linux machine with a multicore x86/64 CPU, but unless the compiler performs reordering on specific instructions, you'll never witness memory reordering at runtime.

It may just be poor wording, but I don't think this sentence makes sense -- it conflates compiler optimizations with memory reordering, and implies that this is dependent of choice of operating system. While the author probably didn't mean this, it's clear from some of the comments in this thread that this is causing confusion to readers. Worse, it's just not true --- while this particular example might not cause problems, memory reordering is still an issue that needs to be dealt with on x86.

Analogies can be helpful for intuition, but I think this is a case where one really needs to understand what's happening under the hood. Treating the CPU as a black box is not a good idea here, and test-driven development is probably not a good approach to writing mutexes. Calling attention to the issue is great, but this is an area where you really want to know what exactly guarantees your processor provides, rather than trying things until you find something that seems to work.

klodolph · on Oct 19, 2012

> It reinforces the idea of Memory Ordering as voodoo

I thought there were some very solid bits in there, like the reminder that memory barriers always come in pairs.

> it conflates compiler optimizations with memory reordering

Well, the code as written is subject to reordering both by the compiler and the CPU. The author is just being honest that this isn't a perfectly crafted example. You could make a version not subject to reordering by the compiler, either by marking variables as volatile or by inserting "compiler barriers" into the code (e.g., empty volatile asm statements). I don't think this is conflation.

> Worse, it's just not true --- while this particular example might not cause problems,

I think what happened is that you interpreted the author's statement as one about memory reordering in general on the x86 architecture, but I think the author is referring to this specific application, which makes the statement true. (The sentence you quoted is bracketed by two sentences which are more explicitly about this demo application, but it's not explicit in the sentence you quoted.)

> memory reordering is still an issue that needs to be dealt with on x86.

The article links to another post by the same author with a similar experiment demonstrating reordering on x86. The example is a lot harder to follow, though, since x86 only allows reordering of stores and loads relative to each other.

http://preshing.com/20120515/memory-reordering-caught-in-the...

> test-driven development is probably not a good approach to writing mutexes

The experiment is to show people how reordering can happen, it has educational purpose. Thermite is a bad way to heat your home, but it's a good experiment to demonstrate exothermic reactions.

Yes, the article could be written better. But there are so few people who write any articles at all on this subject, I'm glad to be able to read it.

nkurz · on Oct 21, 2012

I agree --- definitely better to have an possibly flawed article to discuss, rather than no article at all. And as a whole, the series has lots of great information.

preshing · on Oct 19, 2012

> It may just be poor wording, but I don't think this sentence makes sense

I can see how you could might have interpreted that sentence differently. I just tweaked it a little in the post, mainly changing "the sample" to "this sample". Hopefully it's precise enough for most now.

nkurz · on Oct 21, 2012

Thanks for the update, and for writing such articles in the first place!

qdog · on Oct 19, 2012

This is why I don't like to use shared memory. It's not easy to do this for a variety of reasons.

At a low level, to try and make this work, you need to do more than worry about a mutex. You need the cpu's cache to be out the way, the memory area protected, AND the memory bus transactions to be completed!

So...if c++11 works, this is what it must really do(some of this is handled by the hardware, but these all have to happen...and if there's a hardware bug, you need a software workaround):

1) Lock the memory area to the writing cpu (this could be a mutex with a memory range, but safest, and slowest, is to disable interrupts while you dick with memory. That's unlikely to be available at high level).

2) Write the memory through the cache to the actual memory address OR track the dirty bit to make sure CPU2 fetches memory for CPU1's cache. AND go over to CPU2 and flip the dirty bit if it has this bit of memory in cache...

3) Wait for all the memory to be written by the bus. Depending on the implementer of the but, it's entirely possible to have CPU1's memory writes heading into memory, but not yet committed, when CPU2's request arrives, giving CPU2 a copy of old data! One way to try and fix this is...have CPU1 read-through-cache to the actual memory location, which the bus will flush correctly as the request is coming from the same device that did a previous write. (I used to do embedded programming and had to use this trick at times, it's possible this is the only bus that worked like this, YMMV).

4) Release the locking mechanism and hope it's all correct.

Realizing that a '1 in a million' chance of failure probably equates to months between failures at most, you see why bugs with this stuff appear all the time. If you MUST use shared memory as your interface for some reason, you better be really careful. And maybe look to move to a different method ASAP.

Edit: changed memory controller to bus, oops

callan · on Oct 19, 2012

For those seeking more detail, Linux has a great reference on using memory barriers: http://www.kernel.org/doc/Documentation/memory-barriers.txt

hresult · on Oct 19, 2012

Great article. But the ASCII art in it!

mjb · on Oct 19, 2012

This is a really interesting article. Multi-core ARM seems to be the first really mainstream processor architecture that behaves this way. There have been others, like Alpha, but none have achieved the ubiquity that mult-core ARM has achieved. I suspect a side-effect of this is that many of the "threads are hard" effects that are hidden by x86 will come back to bite a lot of programmers. I think we are going to be seeing a lot more "threads are hard" and "threads and weird" posts in the near future, and hopefully better learning material about threading issues in the longer term. Even more hopefully, this might drive more research and development into abstractions for providing parallelism and concurrency in ways that hide the complexity of threads.

dorianj · on Oct 19, 2012

PowerPC has a weakly ordered memory model, so older Mac programmers are no strangers to this sort of stuff.

There have been many multi-core ARM devices out in both the Android and iOS world for a few years, and I haven't heard too many complaints. I think most good high-level programmers know that they need to use high-level locks provided by the SDK, while anyone writing code that actually needs to deal with memory ordering probably already knows to use barriers some of the pitfalls of different ISAs.

maximilianburke · on Oct 19, 2012

Same for game developers -- the Xbox 360 and PS3 were both multi-core PowerPC architectures (of sorts) that required the developers pay close attention to their barriers when writing concurrent code.

ArbitraryLimits · on Oct 19, 2012

> so older Mac programmers are no strangers to this sort of stuff.

And embedded software programmers. There is still a market for chips that don't consume more power than an electric blanket.

brigade · on Oct 20, 2012

You don't hear complaints because no one is going to find a bug caused by weak memory ordering without already knowing about memory reordering. In real buggy code, failures are so rare that most developers will never personally see a failure. Even the article's contrived case that should fail a hundred times more often than real code does, only fails one time out of a thousand. And if the bug only causes a slight graphical glitch, well, the developer will never notice it on their own. Or explain it as a hundred other things.

I've seen several people who thought they were smart, write code that needed memory barriers and never realize it. Or think that volatile made things safe. Getting them to accept that their code was buggy was hard.

mjb · on Oct 19, 2012

> PowerPC has a weakly ordered memory mode

Right, I forgot about PowerPC.

> I think most good high-level programmers know that they need to use high-level locks provided by the SDK, while anyone writing code that actually needs to deal with memory ordering probably already knows to use barriers some of the pitfalls of different ISAs.

Agreed. The point I was trying to make is that, as these platforms become ubiquitous, they'll start attracting the "average" developer more and the "first mover" developers less. I can't prove that "more likely to move onto a new platform" is correlated with "more likely to be aware of threading pitfalls", but I suspect it to be true.

Tuna-Fish · on Oct 19, 2012

But old PowerPC macs weren't typically multicores, and even when they had multiple CPUs, people rarely programmed threaded software for them. Memory ordering is completely transparent to a single thread -- it's only when you add more of them that you start to have problems.

AnthonyMouse · on Oct 19, 2012

http://en.wikipedia.org/wiki/List_of_Macintosh_models_groupe...

All of the PowerMac G4s and PowerMac G5s from 1999-2006 spanning 350MHz to 2.7GHz were sold in multiprocessor configurations.

Maybe most programmers never worried about writing threaded software for them, but they certainly weren't uncommon.

brigade · on Oct 20, 2012

Only the most expensive PowerMacs had multiprocessors; iMacs, PowerBooks, and iBooks vastly outnumbered those with top-end PowerMacs. Which also meant there wan't much point in writing heavily threaded programs unless specifically targeting high-end users.

tveita · on Oct 19, 2012

Valgrind has tools that supposedly can find certain classes of load/store race conditions. I've never used them in anger, so I can't vouch for them, but it would be interesting to do a test on the example in the article.

Memcheck is certainly a must-have tool for finding heisenbugs in low-level code - it would be wonderful to have an equally effective solution for race conditions.

http://valgrind.org/docs/manual/hg-manual.html

http://valgrind.org/docs/manual/drd-manual.html

osivertsson · on Oct 19, 2012

I have very successfully used ThreadSanitizer [1] on Linux to find data races in a large C++ codebase. The Linux and OSX versions are based on Valgrind/Helgrind.

[1] http://code.google.com/p/data-race-test/wiki/ThreadSanitizer

hobbyist · on Oct 19, 2012

Do, the memory barriers in ARM architecture also flush the caches? In Intel x86 architectures the hardware handles the coherency between all the caches, so a CPU core can directly read from the cache line of another core if it finds its own cache line to be dirty.. Does this happen in ARM also?

lincolnq · on Oct 19, 2012

Yay memory semantics!

A classic case where this sort of problem bit Java in the ass: the "double-checked locking pattern" for initializing Singletons. http://www.ibm.com/developerworks/java/library/j-dcl/index.h...

I'm not sure if this was ever fixed / improved enough to allow the programmer to make this work.

mjb · on Oct 19, 2012

> A classic case where this sort of problem bit Java in the ass

I'm not sure what you mean by "bit Java in the ass". Perhaps pre-Java 5 (or pre JSR133 more accurately) the reason this was broken was poorly defined. In modern Java it's quite clear that the Java memory model implies that the double-checked-locking idiom doesn't work.

See http://jeremymanson.blogspot.com/2008/05/double-checked-lock... and http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLo... for more information. Making the variable volatile fixes it.

> 'm not sure if this was ever fixed / improved enough to allow the programmer to make this work.

I don't know what the value of "making it work" would be. Uncontended access to a volatile is very cheap (see http://brooker.co.za/blog/2012/09/10/volatile.html for some timings on x86), and in very performance sensitive code you'll probably not want to do lazy allocation anyways.

The well defined and well-understood memory model in modern Java is one of the real strengths of the language (and platform).

trhtrsh · on Oct 19, 2012

The "fix" is to define an enum with one member named INSTANCE, if you need a Singleton. -- Effective Java, 2/ed, bu Josh Bloch

oconnor0 · on Oct 19, 2012

Which only works if you need something that doesn't extend existing classes but only interfaces.

makira · on Oct 19, 2012

and this is why you don't implement your own mutexes and use the ones provided by the OS.

usea · on Oct 19, 2012

A question: Why is it the CPU architecture that is weakly ordered, if it's the compiler that is reordering the statements? Couldn't you have a compiler on a weakly ordered arch that preserved order, and a compiler on x86 for example that could reorder your statements?

Isn't it the language spec / compiler that is in charge of this, rather than the CPU? I'd like to know more about this.

klodolph · on Oct 19, 2012

Both the compiler and the CPU can reorder code, as long as the results don't change from the perspective of the code that is executing.

For example, a compiler could take the following code:

    global_x = 1;
    global_y = 2;
    global_x = 3;

And change it into:

    global_x = 3;
    global_y = 2;

Compilers on all platforms reorder statements like this. You can prevent this by making your variables `volatile`, but 99% of the time when you write the word "volatile" in your code you're making a mistake.

andrewaylett · on Oct 19, 2012

In this case, the compiler could have re-ordered the instructions, but he checked to ensure that it hadn't.

rayiner · on Oct 19, 2012

It's not just the compiler reordering statements. The CPU can reorder loads and stores that appear in the machine code.

sigjuice · on Oct 19, 2012

This is not about the compiler reordering statements. This happens in hardware. If a core executes two store instructions, another core might see them in any order.

hresult · on Oct 19, 2012

Great article. CPU reordering is an effect which makes it notoriously difficult to implement lock-free code correctly.