The problems with traditional multi-threaded concurrency go beyond just complexity and safety. They also offer relatively poor performance on modern hardware due to the necessarily poor locality of shared structures and context switching, which causes unnecessary data motion down in the silicon. Whether or not "message passing" avoids this is dependent on the implementation.
Ironically, the fastest model today on typical multi-core silicon looks a lot like old school single-process, single-core event-driven models that you used to see when servers actually had a single core and no threads. One process per physical core, locked to the core, that has complete ownership of its resources. Other processes/cores on the same machine are logically treated little different than if they were on another server. As a bonus, it is very easy to distribute software designed this way.
People used to design software this way back before multithreading took off, and in high-performance computing world they still do because it has higher throughput and better scalability than either lock-based concurrency or lock-free structures by a substantial margin. It has been interesting to see it make a comeback as a model for high concurrency server software, albeit with some distributed systems flavor that was not there in the first go around.
Shared-nothing is great when you can do it. But sometimes the cost of copying is too high, and that's what shared memory is for.
Take, for example, a simple texturing fragment shader in GLSL. You're not going to copy the entire texture to every single GPU unit; it might be a 4096x4096 texture you're rendering only a dozen pixels of. Rather, you take advantage of the caching behavior of the memory hierarchy to have each shading unit only cache the part it needs. This is what shared memory can do for you: it enables you to use the hardware to dynamically distribute the data around to maximize locality.
I did not mean to imply that you are giving every process a copy of all the data. The main trick is decomposition of the application, data model, and operations such that every process may have a thousand discrete and disjoint shards of "stuff" it shares with no other process. The large number of shards per process mean that average load across shards will be relatively balanced. The "one shard per server/core" model is popular but poor architecture precisely because it is expensive to keep balanced.
However, in these models you rarely move data between cores because it is expensive, both due to NUMA and cache effects. Instead, you move the operations to the data, just like you would in a big distributed system. This is the part most software engineers are not used to -- you move the operations to the threads that own the data rather moving to the data to the threads (traditional multithreading) that own the operations. Moving operations is almost always much cheaper than moving data, even within a single server, and operations are not updatable shared state.
This turns out to be a very effective architecture for highly concurrent, write heavy software like database engines. It is much faster than, for example, the currently trendy lock-free architectures. Most of the performance benefit is much better locality and fewer stalls or context switches, but it has the added benefit of implementation simplicity since your main execution path is not sharing anything.
Don't you lose all of your performance gains in RPC overhead? How do you avoid latency in the data thread (do you have one thread per lockable object? won't that be more than 1 thread per core?) - these are the reasons lock-free is so popular.
> Don't you lose all of your performance gains in RPC overhead?
If one did, then why would anyone who knew what they were talking about (or even just knew how to write and use a decent performance test) advocate this method? :)
In a database engine, you ultimately need to move some data around, too. After all, you can't move a network connection to five threads at once, and not all aggregations can be decomposed into pieces that return small amounts of information back. Sharing memory often brings a quite substantial speedup.
agree. But let's say each of these processes bound to a core need to log important information. Well, that logging should probably be done on a separate thread. So, it is really just a matter of using multi threading where appropriate and not using it just to use it
Ironically, the fastest model today on typical multi-core silicon looks a lot like old school single-process, single-core event-driven models that you used to see when servers actually had a single core and no threads. One process per physical core, locked to the core, that has complete ownership of its resources. Other processes/cores on the same machine are logically treated little different than if they were on another server. As a bonus, it is very easy to distribute software designed this way.
People used to design software this way back before multithreading took off, and in high-performance computing world they still do because it has higher throughput and better scalability than either lock-based concurrency or lock-free structures by a substantial margin. It has been interesting to see it make a comeback as a model for high concurrency server software, albeit with some distributed systems flavor that was not there in the first go around.