Doozer: a consistent, highly-available data store from Heroku labs

benblack · on April 13, 2011

(Reposting my blog comment here)

"I mean primarily the ability to face a network partition and continue providing write service to all clients transitively connected to a majority of servers. Secondarily, I mean the ability to continue providing read-only service to all clients connected to any server."

A truly available system, in the sense of CAP, allows writes, not just reads, even under partition, even for clients not connected to the "majority" nodes. This leads inevitably to the need for conflict detection and resolution, and that whole "eventually consistent" thing. What you are describing is very useful, hence the existence of things like Chubby and ZK, but is most definitely not "available", per CAP.

Folks might also be interested in the classic paper on constructing locks from low level primitives like CAS, Herlihy's 'Wait-free Synchronization" http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=A01...

jdf · on April 14, 2011

There are a number of other systems that allow this same approach of consistency and high availability. For example, Cassandra, which is freely available (as required by the poster), appears to be able to give you this behavior if you set ConsistencyLevel to QUORUM.

Clustrix, the company I work for, offers a full SQL data store with similar quorum semantics. However, it's not free.

Google Megastore allows similar consistency semantics (with its own data model) in a cross data center "cloud" fashion. It's also not free, but it would probably be suitable for some set of Heroku customers, particularly if they're already using Google App Engine.

leef · on April 14, 2011

You actually cannot implement strongly consistent compare-and-set on a quorum based system. This is because quorums do not have strong failure semantics.

For example, lets say you have key A with data 'foo' living on all 3 nodes of a quorum based system. You perform a write to key A with data 'bar' and one host accepts the write and two fail resulting in your client getting a failure back. Reading data back from all nodes will return 'foo'. Time goes along and read repair or some other anti-entropy process kicks in a replicates 'bar' to the other hosts. Now a read for key A returns 'bar'. In short, when a client performs a write and gets a failure back they actually have no idea if it will update the data store or not. You cannot implement compare-and-set in that type of environment.

krarick · on April 14, 2011

That is not how paxos works. It guarantees that all hosts in the system learn the same sequence of writes. The scenario where "one host accepts the write and two fail" does not happen. This is proved in http://research.microsoft.com/users/lamport/pubs/paxos-simpl...

leef · on April 14, 2011

Right, I was talking about the quorum semantics as used by systems like Dynamo and Cassandra. Paxos provides a much stronger guarantee than simple quorums.

vicaya · on April 14, 2011

"The only freely-available tool that attempts to provide both of these is zookeeper, but zookeeper is specialized, focused on locking and server management. We need a general-purpose tool that provides the same guarantees in a clean, well-designed package."

I think they dismissed Zookeeper too quickly without trying to understand it first. The zookeeper primitives (get/set nodes and their children) seems simpler and cleaner than doozer is client protocol, AFAICT. Zookeeper should scale better (especially for readers) than typical Paxos based systems as well.

leef · on April 15, 2011

The problem with zookeeper is there is no safe way to dynamically change the nodes in the cluster. That's where Paxos comes in since it is provably correct even with nodes joining and leaving.

sigil · on April 13, 2011

Cool. Is the consensus algorithm straight-up Paxos, or are there modifications? The Paxos Made Live [1] paper from Google raised some interesting issues.

[1] http://labs.google.com/papers/paxos_made_live.html

bmizerany · on April 13, 2011

Yes, it is Paxos. We have a post coming up about it.

sigil · on April 13, 2011

Great, looking forward to it.

besquared · on April 13, 2011

What happens when you CAS the same key from two sides of a network partition and the partitions rejoin?

benblack · on April 13, 2011

As explained in the post, the minority side of the partition is not writable.

besquared · on April 13, 2011

Ahh I see it's in the extra small note at the bottom

"I mean primarily the ability to face a network partition and continue providing write service to all clients transitively connected to a majority of servers."

Probably should be a little more prominent considering it's one of the first questions you'd want answered but I was a bit careless to miss it.

packetslave · on April 13, 2011

One of the Fraggles wanders by and eats half your data.