> There are problems that exist in distributed system environments that do not exist in single-machine environments. Partial failure, concurrency, and latency are three problems that make distributed computing fundamentally different from local computing.
IMO this can frame our thinking too much. Concurrency and latency are problems in single systems as well. It would be great to have single systems that handle partial failure, can be updated incrementally, etc. Thinking of 'distributed computing' as a different domain than single system computing is not a useful long term approach. In the end very few systems are 'single machine' systems, if at all - machines are all connected to each other and participate in other, larger, systems.
> IMO this can frame our thinking too much. Concurrency and latency are problems in single systems as well.
Right, but every step away from the CPU die you magnify the latency and failure problems by orders of magnitude. Memory is many orders of magnitude higher latency and more susceptible to random bit flips than CPU registers. Another thread's memory is orders of magnitude more expensive still due to cache sync protocols, and more prone to failure necessitating retries due to contention. Another server's data yet more orders of magnitude.
At some levels you can ignore latency and failures as practically non-existent, and as you say, in concurrency you can't ignore them but failures are this level are transient. Failures in distributed systems typically persist because the higher latency make retries much more costly, so we look for other shortcuts that typically aren't as effective as solutions employed in concurrency.
Some classes of problems are masked on a single system. For instance, failure of the networking card doesn't preclude the processes from talking to each other. A process that's running on a system that has hung can't make any forward progress so it doesn't care if another process it's talking to is in the same state.
Our desktop computers look more and more like clusters in their own right (how many SoCs are inside of your computer right now?). There's a lot of talk lately about having high performance and low power CPUs in the same machine working in tandem.
I figure at some point we will start formally treating desktops and laptops like small clusters, at which point these distinctions might start to disappear.
> Some classes of problems are masked on a single system.
Yes but you can also have IPC failure, syscall hang, single thread hang etc. Instead of treating these as special cases, these could be handled by generic patterns that apply to 'distributed computing', if the system was designed similarly.
This looks like a reasonably complete survey. However, one important topic is missing. That is the strong consistency algorithms used by databases like Spanner and FaunaDB.
I don’t see anything in there even about basic building blocks like RAFT. Maybe I’m missing it, or maybe there’s room for someone to write another chapter.
... looks awesome, and something that seems only possible with dynamic code... or at least code that can introspect it's own AST.
EDIT: the cool part about it, I think, is that you don't have to shoehorn your algorithm into any particular model like map reduce or something else. You just write idiomatic Python code and... magic
I am a bit torn with this, because while I love people reading and then writing up things, there are so many changes I think they should make but I don't want to just barge in and make pull requests.
E.g., they largely miss DryadLINQ (disclosure: not an author, but worked with them), which I would rate as the second most important paper after MapReduce in the "batch compute" space. It shows up only as what you should use to write Dryad programs, because Dryad's programming model "is less elegant than MapReduce"; harsh.
The streaming section (disclosure: Naiad co-author) dances around and then suggests that Google Dataflow is somehow a unification of previous work, which is .. well, you can read the Google Dataflow paper
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
and do a search for "Naiad"; it's not even on their radar.
It feels a bit like the critiques are perhaps too-credulous readings of the related work sections of other papers.
I'm not sure a great way to get to a consensus understanding of the position of the works in the historical record, or whether it is even possible with sensitive authors (hi), but it would be cool to have. Any recommendations would be great.
Well-researched and comprehensive. The survey of distributed languages includes not just Erlang and Akka, but also E and Cloud Haskell. The overview of CAP is deep and rigorous while not wasting words. This will be a great book once it's finished.
This should mention Microsoft Research's Concurrent Revisions [1]. I think it's by far the easiest abstraction to grasp for ordinary programmers, particularly if you've ever used version control. In that case, you already know how they work! Revisions scale from incremental, parallel and concurrent programming all the way up to distributed computation. Very neat work.
IMO this can frame our thinking too much. Concurrency and latency are problems in single systems as well. It would be great to have single systems that handle partial failure, can be updated incrementally, etc. Thinking of 'distributed computing' as a different domain than single system computing is not a useful long term approach. In the end very few systems are 'single machine' systems, if at all - machines are all connected to each other and participate in other, larger, systems.