do you think we'll see a lot of HPC apps using C++11 for parallelism within a no...

jandrewrogers · on Aug 22, 2012

MPI is messaging interface rather than a parallelism API. Even in C++ most of the parallelism constructs are a strawman because many high-performance computing codes are written as single-threaded processes locked to individual cores and communicating over a messaging interface of some type. The parallelism is implemented at a higher level than either the messaging interface or the code. Many supercomputing platforms support MPI but not all of them do.

The practice of a single process locked to each core communicating over a messaging interface has trickled down to more generic massively distributed systems work because it has very good properties on modern hardware. You end up doing a fair amount of functional programming in this model because multiple tasks are managed via coroutines and other lightweight event models. This architecture is very easy to scale out because it treats every core -- on the same chip, same motherboard, or same network -- as a remote resource that has to be messaged.

MPI has one significant problem for massively parallel systems in that it has tended to be brittle when failures occur, and on sufficiently large systems failures are a routine problem. There are ways to work around it but it is not the most resilient basis for communication in extremely large systems. At the high-end of HPC MPI and similar interfaces are commonly used but for many of the next generation non-HPC systems operating on a similar scale they are using custom network processing engines built on top of IP that give more fine-grained control over network behaviors and semantics. This is not faster than MPI and often slower, and tends to be a bit more complex but it allows more robustness and resilience to be built in at a lower level. MPI was designed for a set of assumptions that work for many classic supercomputing applications but which don't match many current use cases.

jedbrown · on Aug 22, 2012

The major thing that MPI did right, and that almost all other models have done wrong, is library support. Things like attribute caching on communicators are essential to me as a parallel library developer, but look superfluous in the simple examples and for most applications.

The other thing that is increasingly important in the multicore CPU space is memory locality. It's vastly more common to be limited by memory bandwidth and latency than by the execution unit. When we start analyzing approaches with a parallel complexity model based on memory movement instead of flops, the separate address space in the MPI model doesn't look so bad. The main thing that it doesn't support is cooperative cache sharing (e.g. weakly synchronized using buddy prefetch), which is becoming especially important as we get multiple threads per core.

As for fault tolerance, the MPI forum was not happy with any of the deeper proposals for MPI-3. They recognize that it's an important issue and many people think it will be a large enough change that the next standard will be MPI-4. From my perspective, the main thing I want is a partial checkpointing system by which I can perform partial restart and reattach communicators. Everything else can be handled by other libraries. My colleagues in the MPI-FT working group expect something like this to be supported in the next round, likely with preliminary implementations in the next couple years. For now, there is MPIX_Comm_group_failed(), MPIX_Comm_remote_group_failed(), and MPIX_Comm_reenable_anysource().

tmurray · on Aug 22, 2012

any chance you can comment on the fault tolerance proposal in MPI 3?

also, do you have any examples of custom network engines on top of IP?