I'm happy you enjoyed the post and thanks for the comment. If I understand your ...

ahachete · on Aug 28, 2015

Yes, I got that, maybe I wasn't clear enough. My main concern is that slow nodes (rather than failing nodes) may easily provoke latency spikes, and that seems to me a quite frequent situation. The good point about quorum writes is that outliers are ignored, but with the ISR, as you need to wait for all of them, outliers are not ignored (until, maybe, removed from the ISR). I understand the advantages and the compromise here. But I would like to see if this is a good compromise, as outliers may have a big impact.

fpj · on Aug 28, 2015

Got it, yeah, quorum systems have higher tolerance to tail latency, there is no question about it. we do mention it briefly in the post, but we don't have numbers to show. I'm not aware of it being a major concern for kafka deployments, but I can say that for Apache BookKeeper, we ended up adding the notion of ack quorums to get around such latency spikes. I'll see if I can gather some more information about kafka that we can share at a later time. Thanks for raising this point.

ahachete · on Aug 28, 2015

That would be awesome to have some numbers about this topic. Thanks for your interest. I guess with other systems like Paxos it could be solved by separating the notion of learners and acceptors, which are usually collapsed in the same nodes. In this case, you may have more learners than acceptors (solving the N^2 communication growth with the number of nodes) while still solving the tail latency by running quorum among the acceptors.

linuxhansl · on Aug 28, 2015

I think a further interesting observation is that one can actually trade time uncertainty (i.e. latency spikes) for location uncertainty.

One can drive latency almost arbitrarily low if one is willing to give up exact knowledge of a where the data is.

A simple example is N out of M writes (N and M can be arbitrary with N <= M). M is a set of machines, N is number of replicas we want to have. Now at any given time we write to the N machines that respond fastest. Data is now arbitrarily sprayed over the M machines, but as long as the data itself has ordering information the right state can be reassembled by talking to the M machines.

The "spray" can now be controlled by favouring some nodes from M up to a timeout (i.e. we put more uncertainty in time). We can reduce the reassembly work by using learners and hence increase the likelihood that one machine has all state.

fpj · on Aug 28, 2015

yeah, this is precisely what we do in Apache BookKeeper, and we map N to ack quorums and M to write quorums.