Programming Models for Distributed Computation

shalabhc · on Oct 6, 2017

> There are problems that exist in distributed system environments that do not exist in single-machine environments. Partial failure, concurrency, and latency are three problems that make distributed computing fundamentally different from local computing.

IMO this can frame our thinking too much. Concurrency and latency are problems in single systems as well. It would be great to have single systems that handle partial failure, can be updated incrementally, etc. Thinking of 'distributed computing' as a different domain than single system computing is not a useful long term approach. In the end very few systems are 'single machine' systems, if at all - machines are all connected to each other and participate in other, larger, systems.

naasking · on Oct 6, 2017

> IMO this can frame our thinking too much. Concurrency and latency are problems in single systems as well.

Right, but every step away from the CPU die you magnify the latency and failure problems by orders of magnitude. Memory is many orders of magnitude higher latency and more susceptible to random bit flips than CPU registers. Another thread's memory is orders of magnitude more expensive still due to cache sync protocols, and more prone to failure necessitating retries due to contention. Another server's data yet more orders of magnitude.

At some levels you can ignore latency and failures as practically non-existent, and as you say, in concurrency you can't ignore them but failures are this level are transient. Failures in distributed systems typically persist because the higher latency make retries much more costly, so we look for other shortcuts that typically aren't as effective as solutions employed in concurrency.

hinkley · on Oct 6, 2017

Some classes of problems are masked on a single system. For instance, failure of the networking card doesn't preclude the processes from talking to each other. A process that's running on a system that has hung can't make any forward progress so it doesn't care if another process it's talking to is in the same state.

Our desktop computers look more and more like clusters in their own right (how many SoCs are inside of your computer right now?). There's a lot of talk lately about having high performance and low power CPUs in the same machine working in tandem.

I figure at some point we will start formally treating desktops and laptops like small clusters, at which point these distinctions might start to disappear.

shalabhc · on Oct 6, 2017

> Some classes of problems are masked on a single system.

Yes but you can also have IPC failure, syscall hang, single thread hang etc. Instead of treating these as special cases, these could be handled by generic patterns that apply to 'distributed computing', if the system was designed similarly.

jchanimal · on Oct 6, 2017

This looks like a reasonably complete survey. However, one important topic is missing. That is the strong consistency algorithms used by databases like Spanner and FaunaDB.

Those two globally consistent databases use different families of algorithms, so it’s probably worth including both. Here is a comparison between them: https://fauna.com/blog/distributed-consistency-at-scale-span...

xtacy · on Oct 6, 2017

I believe it's covered: Topic 6.

jchanimal · on Oct 6, 2017

I don’t see anything in there even about basic building blocks like RAFT. Maybe I’m missing it, or maybe there’s room for someone to write another chapter.

KirinDave · on Oct 6, 2017

Specific algorithms like RAFT or Paxos are more of an addendum than a core aspect of the topic, though.

ericfrederich · on Oct 6, 2017

I watched a talk on Dask

https://www.youtube.com/watch?v=RA_2qdipVng

... looks awesome, and something that seems only possible with dynamic code... or at least code that can introspect it's own AST.

EDIT: the cool part about it, I think, is that you don't have to shoehorn your algorithm into any particular model like map reduce or something else. You just write idiomatic Python code and... magic

frankmcsherry · on Oct 7, 2017

I am a bit torn with this, because while I love people reading and then writing up things, there are so many changes I think they should make but I don't want to just barge in and make pull requests.

E.g., they largely miss DryadLINQ (disclosure: not an author, but worked with them), which I would rate as the second most important paper after MapReduce in the "batch compute" space. It shows up only as what you should use to write Dryad programs, because Dryad's programming model "is less elegant than MapReduce"; harsh.

The streaming section (disclosure: Naiad co-author) dances around and then suggests that Google Dataflow is somehow a unification of previous work, which is .. well, you can read the Google Dataflow paper

    http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

and do a search for "Naiad"; it's not even on their radar.

It feels a bit like the critiques are perhaps too-credulous readings of the related work sections of other papers.

I'm not sure a great way to get to a consensus understanding of the position of the works in the historical record, or whether it is even possible with sensitive authors (hi), but it would be cool to have. Any recommendations would be great.

DonbunEf7 · on Oct 6, 2017

Well-researched and comprehensive. The survey of distributed languages includes not just Erlang and Akka, but also E and Cloud Haskell. The overview of CAP is deep and rigorous while not wasting words. This will be a great book once it's finished.

naasking · on Oct 6, 2017

This should mention Microsoft Research's Concurrent Revisions [1]. I think it's by far the easiest abstraction to grasp for ordinary programmers, particularly if you've ever used version control. In that case, you already know how they work! Revisions scale from incremental, parallel and concurrent programming all the way up to distributed computation. Very neat work.

[1] http://lambda-the-ultimate.org/node/4699

sudouser · on Oct 7, 2017

Thanks for bringing that up naasking, quite nice indeed

shalabhc · on Oct 6, 2017

Perhaps https://en.wikipedia.org/wiki/Croquet_Project should be included in Chapter 4: https://github.com/heathermiller/dist-prog-book/blob/master/...

bogomipz · on Oct 6, 2017

This looks great. I look forward to following this and reading the updates. Thanks for sharing!

zengid · on Oct 6, 2017

No mention of session types?

http://groups.inf.ed.ac.uk/abcd/

https://youtu.be/8GJ2AnCjCvU

macintux · on Oct 6, 2017

Pretty cool to see Bloom and Lasp covered. Definitely a sign that they're serious about covering many aspects of the problem space.

bootcat · on Oct 6, 2017

Nice coverage !! For some reason they don't offer a one click pdf and its bit difficult to navigate !