Problems with CAP, and Yahoo’s little known NoSQL system

allertonm · on April 26, 2010

Attacking CAP seems to be a recent theme - Stonebraker did this a few weeks back (http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database...)

Abadi's take seemed more interesting to me since it isn't rubbishing CAP as such, but noting that C, A & P are not symmetrical and also that there is a fourth consideration - latency - that most discussions of this topic do not take into account even though it is clearly a design factor in some existing NoSQL databases.

strlen · on April 27, 2010

I think latency is the un-written, default assumption in all of these cases. High-latency, strongly consistent, shared nothing distributed relational databases are over two decades old (starting with Stonebraker's "a case for shared nothing"). Problem is that they're unusable for OLTP workloads (without non-commodity hardware appliances, using SSDs and connected over Infiband e.g., Exadata) even if you don't "CAP" into account.

Latency isn't a "tunable" variable in the same sense as C and P, in that there's multiple ways to lower latency. With eventual consistency you can use smaller quorums to lower the latency for either reads, writes or both. You can also use strong consistency, to avoid having to do quorums at read times but at higher latency writes. Higher latency writes may be acceptable, at least with 2PC. I am not sure many systems use Paxos for replicating a transaction log on every write due to the costs: I believe Scalaris may be, but I can't find the information right now.

seymourz · on April 26, 2010

This is because that Latency is strongly relevant to Consistency.

roder · on April 26, 2010

Latency is also a multipart problem, right? The software can always attempt to improve it's response time, but it can't change the latency on the wire.

sophacles · on April 26, 2010

It seems that he ignores a certain necessity in his CA == CP premise. Specifically, that premise is that to achieve CA the system must be a single logical unit. One the network partitions, anyone in the correct partition (the same one as the CA system) still has access, therefore the system is still available. This sounds bunk, because the system is not available to those on the wrong partition, however, this is not an availability issue, it is a partition issue -- pretty much by definition.

kingkilr · on April 26, 2010

Wish my DB systems course even covered this stuff. It's an interesting take on it, however in practice Cassandra does have a strong consistency guarantee when you don't have a network partition, so I think the point he tries to make about C v A tradeoffs is weaker in light of this.

strlen · on April 26, 2010

The popular misconception about eventually consistency systems is that eventual consistency means you don't get to read-your-writes. In reality, however, eventual consistency means eventual consistency of individual replicas: you can still get a consistent view of the key space even if (for a very small point in time) the individual replicas are inconsistent.

You can use quorum protocols and version vectors to get guaranteed read-your-writes consistency (when R + W > N). The cost is higher latency (for the operations that require the quorums) and loss of availability during a network partition (without additional heuristics such as "preferred R" or "preferred W").

Even with weak eventual consistency (when R + W < N), in a Dynamo-like system not being able to read-your-writes is still only a failure condition (when the first W nodes in the preference list for a key fail).

With traditional MySQL or Postgres/Slony replication between master/slave or master/master your guarantees are actually much lower. You're not guaranteed to be able to read your writes (if you use transactions you're guaranteed W=1, else you're guaranteed W=0). Such a system is also "likely consistent" rather than "eventually consistency": there is no guarantee of that consistency will eventually take place.

One thing to note is that PNUTs takes a hybrid approach: there's strong consistency (2PC) on all the replicas within a "datacenter" (my unsupported assertion is that this actually means "on all replicas on a single core switch" rather than a datacenter per-se), but eventual consistency between "datacenters". This means that there's no need to do quorum reads or use preference lists in a LAN environment: you can go to any replica and get the right value back, meaning you get "read-your-writes" consistency with lower latency. Developers also don't need to worry about sending vector clocks along during each put, it's easier to implementation mutations of high-contention keys (in an eventually consistency system this would have be done using "optimistic locking" and vector clocks, e.g. < http://project-voldemort.com/javadoc/all/voldemort/client/St... >). Having an agreed-upon transaction log helps too: there's no need to have multiple consistency repair mechanisms for nodes that have been temporarily out of service (in Dynamo's case it's read-repair, hinted hand-off and Merkle trees): you just replay the delta in the transaction log (starting from the latest checkpoint on the out of date replica and ending on the "high water mark").

The apparent downside is more expensive writes (2PC can be pricier than simple quorum writes -- which is sometimes an acceptable trade off), loss of fault tolerance when the 2PC coordinator node crashes (in reality, this is likely very rare), higher demands on the network. Anything that can create a partition in a "CA" LAN-zone (e.g., network switches, Ethernet cables) becomes a single point of failure (however, if you put all Cassandra/Voldemort/Riak nodes on a single rack, behind a single switch, you've got the same problem).

The "unwritten" downside, is complexity and lack of symmetry in the implementation: different components require different hardware and software profiles and the like, failure of one type of node is more costly than failure of other.

That's not to say it's wrong, it's just different sets of trade offs: complexity of application development due to loss of replication transparency i.e., having to deal with vector clocks vs. complexity of system development and operations.

Note, that if you're read the BigTable papers from Google you'll see the same story. Google File System itself is actually eventually consistent (which is fairly common e.g., CodaFS). BigTable itself, however, is strongly consistent: Chubby (which employs Paxos, a form of 3PC) ensure that requests are routed to specific tablet servers (which then check-pointed to GFS) and that only one master was active at a time. Note that Paxos is not used to explicitly agree on a replicated transaction log (the "atomic, totally ordered multicast" problem), my own guess is that this would simply be too expensive.

Note that in BigTable routing information is consistent across the entire cluster, unlike a Dynamo system which uses Gossip for routing table propagation (advantage being only needing to specify a smaller number of "seeds" when connecting to/joining a cluster, disadvantages being not instantly seeing dead nodes -- needing additional failure detection -- and potential scalability issues with Gossip on clusters >1000 machines).

When a single BigTable cell was insufficient, multiple BigTables cells were setup with eventually consistency replication. Unfortunately there's no public paper (yet) on Spanner, but my bet that Spanner is build on that same pattern (eventually consistent layer on top of BigTable).

It's also worth it to read Google's "Paxos Made Live" which shows the complexity of implementing Paxos: it's a beautiful algorithm, but with many corner cases. Fortunately ZooKeeper (which provides Chubby-like functionality e.g., distributed locks, "ephemeral nodes" and uses its own form of 3PC -- ZAB)is available and is open source.

Both Dynamo, PNUTs and BigTable have another unsaid assumption: if you want technology to be an advantage, hire top talent e.g., Vogels, Dean, Ghewamat. Most specifically, the sort of top talent that can take a "business problem" and boil it down to the form where they're solvable by elegant data structures and algorithms, deciding which trade offs make sense.

moonpolysoft · on April 26, 2010

I remember the Yahoo guys saying they would open source PNUTS/sherpa at the very first NoSQL meetup. It's a shame that it's been so long and they have not followed through on that.

strlen · on April 26, 2010

They open sourced the traffic server which is responsible for request routing. However the traffic server came from an acquisition (Inktomi) and was always intended as a commercial project that could run on other people's systems.

I'd wager that the real issue is dependency on some of the core Yahoo libraries: there is a whole number of them and they're often so widely used that lot of ex-Yahoo engineers are surprised to discover they're not open-source. The reason they're not open source is that some of these would not be useful outside of Yahoo and create maintenance problems e.g., features that were once missing in FreeBSD, are available in recent versions of FreeBSD as well as Linux but had to be ported forward to preserve API compatibility.