Distributed consensus reloaded: ZooKeeper and replication in Kafka

ahachete · on Aug 27, 2015

Very insightful post. My main concern with this approach is that having the ISR a fixed size (at any given time), as the post explains, it is subject to latency spikes if a node of the ISR is slow. This seems to me it would happen a lot in real cases, and its a major drawback.

So while providing a lot of flexibility and decreasing the communication overhead when using a lot of nodes, it may introduce performance degradation and latency spikes when nodes are slow. I'd probably prefer a design where some kind of reduced majority is required, but not a full list of nodes, so that outliers would be filtered out.

Are there good percentile numbers over there to check metadata writes over the, say, 99,99%-99,999% percentile, to check how bad this may become?

fpj · on Aug 27, 2015

I'm happy you enjoyed the post and thanks for the comment.

If I understand your comment correctly, it isn't the case that the ISR is fixed. The ISR has a minimum size, but it can change over time, so brokers can be removed from the ISR and they can rejoin later once they catch up.

ahachete · on Aug 28, 2015

Yes, I got that, maybe I wasn't clear enough. My main concern is that slow nodes (rather than failing nodes) may easily provoke latency spikes, and that seems to me a quite frequent situation. The good point about quorum writes is that outliers are ignored, but with the ISR, as you need to wait for all of them, outliers are not ignored (until, maybe, removed from the ISR). I understand the advantages and the compromise here. But I would like to see if this is a good compromise, as outliers may have a big impact.

fpj · on Aug 28, 2015

Got it, yeah, quorum systems have higher tolerance to tail latency, there is no question about it. we do mention it briefly in the post, but we don't have numbers to show. I'm not aware of it being a major concern for kafka deployments, but I can say that for Apache BookKeeper, we ended up adding the notion of ack quorums to get around such latency spikes. I'll see if I can gather some more information about kafka that we can share at a later time. Thanks for raising this point.

ahachete · on Aug 28, 2015

That would be awesome to have some numbers about this topic. Thanks for your interest. I guess with other systems like Paxos it could be solved by separating the notion of learners and acceptors, which are usually collapsed in the same nodes. In this case, you may have more learners than acceptors (solving the N^2 communication growth with the number of nodes) while still solving the tail latency by running quorum among the acceptors.

linuxhansl · on Aug 28, 2015

I think a further interesting observation is that one can actually trade time uncertainty (i.e. latency spikes) for location uncertainty.

One can drive latency almost arbitrarily low if one is willing to give up exact knowledge of a where the data is.

A simple example is N out of M writes (N and M can be arbitrary with N <= M). M is a set of machines, N is number of replicas we want to have. Now at any given time we write to the N machines that respond fastest. Data is now arbitrarily sprayed over the M machines, but as long as the data itself has ordering information the right state can be reassembled by talking to the M machines.

The "spray" can now be controlled by favouring some nodes from M up to a timeout (i.e. we put more uncertainty in time). We can reduce the reassembly work by using learners and hence increase the likelihood that one machine has all state.

fpj · on Aug 28, 2015

yeah, this is precisely what we do in Apache BookKeeper, and we map N to ack quorums and M to write quorums.

nano_o · on Aug 27, 2015

It seems similar to the primary-backup instance of the Vertical Paxos family. In the primary-backup Vertical Paxos, one can tolerate f faults with f+1 replicas as long as a reliable external reconfiguration master is there to replace failed replicas and make sure everyone agrees on the configuration. Here the external reconfiguration master would be ZooKeeper and the primary-backup protocol the ISR protocol.

http://research.microsoft.com/en-us/um/people/lamport/pubs/v...

mjb · on Aug 27, 2015

Yeah, there are lots of similarities to Vertical Paxos. It also shares a lot van Renesse and Schneider's classic Chain Replication paper (https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...), with the use of a configuration master to configure a chain that can't solve uniform consensus by itself.

That's not a criticism at all, though. There are a lot of good design and systems ideas there, and they are well explained in this post.

fpj · on Aug 27, 2015

I'm glad you enjoyed the post. The chain replication work indeed relies on a paxos-based master service for reconfiguration, but I believe that's the only similarity, the replication scheme is different otherwise.

fpj · on Aug 27, 2015

Thanks for the comment, the vertical paxos reference is indeed relevant here.

dwenzek · on Aug 28, 2015

I find this post a bit disappointing.

The subject is exciting and I wish to learn more on how to design an efficient and safe replication scheme on top of two coordination protocols, each with its core set of garanties and constraints.

The post gives a good overview of the trade-off between safety and performance made with in-sync replicas (ISR) compared to quorum acknowledgement.

But it remains very vague on how to deal with the problem stated in the "ZooKeeper and consensus" section: being consistent does not mean that the values read [by two workers] are the same necessarily but are only computed after consistent growing prefixes of the sequence of updates. I totally fail to understand the way proposed to break the tie.

I would expect the schema better shows late replica and even late views of the ISR. I would expect more evidences on how the system ensures a message produced to a consumer is never retracted.

fpj · on Aug 28, 2015

Thanks for your comments, and I'm sorry that you feel that the post does not match your expectation. If you write me directly, I'll be more than happy to clarify any question you may have. I'm not sure, for example, what is confusing you about the "zookeeper and consensus" section. I'm also not sure what kind of evidence you're after on published messages being lost.

dwenzek · on Aug 28, 2015

Thanks to propose your help !

What I find vague in "ZooKeeper and consensus" is the answer to "Why does this proposal work compared to the original one?":

>> Because each of the workers has “proposed” a single value

Does the processes read or propose the value ?

>> and no changes to those values are supposed to occur.

Not suppose to ? How do you ensure that ?

>> Over time, as the configurator changes the value,

>> they can agree on the different values by running independent instances

>> of this simple agreement recipe.

Sorry, but I fail to see what recipe you speak about.

fpj · on Aug 28, 2015

>> Does the processes read or propose the value ?

It does both, it first proposes by writing a sequential znode and then reads the children (all proposals are written under some parent znode). This is certainly assuming some experience with ZK, and I wonder if that's the problem. It was not the goal to go into a discussion of the ZooKeeper API, but I'm happy to clarify if this is what is preventing you from getting the point.

>> Not suppose to ? How do you ensure that ?

It is not supposed to in the sense that if this is implemented right, then the proposed values for each client won't change. The recipe guarantees it because it assumes that each client writes just once.

>> Sorry, but I fail to see what recipe you speak about.

I'm referring to these three steps: creating a sequential znode under a known parent, reading the children, picking the value in the znode with smallest sequence number.

dwenzek · on Aug 28, 2015

Thanks for these explanations ! Things are much more clear now.

Indeed, I have no experience in programming zookeeper, even I use it a lot behind tools like kafka or storm.

zenbowman · on Aug 28, 2015

Well written and understandable. The use of illustrations was particularly good.

nehanarkhede · on Aug 28, 2015

Thank you!