There are two main reasons why, when I was researching scalable databases, I primarily gravitated towards Dynamo-style replication (Cassandra, Voldemort, and at the time, Dynomite):
- There is no such thing as failover. Dynamo replication takes node failure in stride. This is what you want for a robust system where "Network Partitions are Rare, Server Failures are Not." Not only does it prevent temporary unavailability during the failover, it rules out an entire class of difficult, edge-case bugs. (Which every master-election-and-failover system out there has been plagued with.)
- It generalizes to multiple datacenters as easily as to multiple machines, allowing local latencies for reads AND writes, in contrast to master-based systems where you always have to hit the master (possibly cross-DC) for at least writes. (Couchbase is unusual in that it apparently forces read-from-master as well.) Cassandra has pushed this the farthest, allowing you to choose synchronous replication to local replicas and asynchronous to remote ones, for instance: http://www.datastax.com/docs/1.2/dml/data_consistency
I'm not denying that Dynamo features more consistent availability, but it does so at either cost of temporal consistency or a much larger amount of resources. That's the point of the article, the tradeoffs are expensive and you can achieve effectively the same performance and availability with fewer resources.
Dynamo may rule out a certain class of bugs, but that doesn't mean other systems must also have those bugs.
And Couchbase also has multi-data center master/master capabilities, with any topology: chain, ring, hub and spoke or any combination therein.
> I'm not denying that Dynamo features more consistent availability, but it does so at either cost of temporal consistency or a much larger amount of resources.
The trouble is that your article outlines problems with pretty much all quorum systems, including multi-Paxos: the setup where there's an elected leader -- elected by first round of Paxos -- which then performs subsequent writes using a single-round-trip second-round.
Incidentally, that is very close to what you've proposed, except you've chosen to not perform quorum reads. That is perfectly fine, but there are also hidden costs -- not only do you need to have a leader election, but you also need synchronization barriers. Before a node can handle writes or serve latest reads, it must also back-fill its transaction up to the last entry (which it does by reading from peers).
So while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity.
You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster -- I am speaking from personal experience), you will have to accept that you're bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency.
The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc...), or even more complex dependent failures.
The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It's perfect fine to trade off availability for other goals -- you just need to be aware of that trade off.
You may want to note evolution of Google's services: BigTable -> MegaStore -> Spanner. Essentially, they've started out with a non-HA system (BigTable), found that every team has began implement HA on their own, and then added quorum-based protocols (Paxos), finally adding an optimization (TrueTime) to reduce read-latency.
One quick point - AFAIK Megastore uses quorum based across data centers but uses BigTable within a data center - so master based. Is that not true? If it is then your point does need to be qualified - cross center DC failure models are (I assume) different from those within a DC.
Temporal consistency is overrated for many applications and still more applications achieve consistency with a locking mechanism or serializing reads and writes elsewhere.
As for resources, I've heard it say that your architecture is working for you if you can throw more resources at the problem, and it scales linearly. If I have a key that gets accessed 2000/s on a single Couchbase node, I'm going to get long tails on every request that goes to that node.
Can you possibly be a bit more specific when talking about 'resources' and 'much larger'. Makes it hard to evaluate if it is even an issue or not.
Also can you elaborate more on the multi-DC part of CouchBase. Documentation seems to imply only supports two-DC replication and I find the "currently supports continuous replication of data" line to be odd. What are the future plans for CouchBase in this area ?
There's at least one good reason for Dynamo's write-to-all and read-from-all mechanism: latency.
What you've called 'W=2' in Couchbase is "write to master and at least one slave." Dynamo-style 'W=2' means "write to any two replicas." This can decrease tail latencies since you don't have to wait for the master--any two will do; similarly for 'R=2'. Indeed, Dynamo 'W=2, R=2' will incur more read load than master-based reads (at least double, but not necessarily triple, in your figures). So I think it's more accurately a trade-off between latency and server load.
Anyway, I'm pretty sure CASSANDRA-4705 (https://issues.apache.org/jira/browse/CASSANDRA-4705), which allows for Dean-style redundant requests, both decreases the read load (at least from the factor of N in your post) and should still reduce tail latency without compromising on semantics.
I don't have skin in this game, but I'm pretty sure that the Dynamo engineers had a good idea of what they were doing. (That said, the regular [non-linearizable] semantics for R+W>N are sort of annoying compared to a master-slave system, but can be fixed with write-backs.)
Good point. But "writes" are very fast, in our tests write latency is less than half read latency, so we can easily do master to slave replication within the SLA. But you point is correct, a Dynamo system is faster to achieve the same replication factor.
I think it's worth elaborating that the primary advantage to the dynamo model is not in the best- or average- case, but when everything does not go as planned -- when the master gets behind, when ec2 network latency spikes, etc. Then "any two replicas" instead of "master plus one more" is much more robust.
This also exactly describes how HBase works. I've always preferred HBase to Cassandra for this exact reason. You put far less read load on your servers and you don't have to worry about most of the things on http://wiki.apache.org/cassandra/Operations.
Another benefit that is not mentioned is that with a master based system you can easily move who is responsible for the data if a server starts to hotspot. In Cassandra you have to use random key distribution because if you have a server hotspot then the only solution is to split the token ring which is an intensive operation that is hard to do while the server is under heavy load.
I hear the "master design is bad" argument all the time.
From many angles it is a bad design, but from other's it is not. First off, it is simpler and easier grok and check for bugs. Debugging a running system is easier.
It is also easier to implement different distribution strategies and failure/placementgroups, because that algorithms is centralized.
If things go wrong it is easier to track where your data is.
No gossip rings to converge. And it is easier to grow the cluster.
There are a few problems with read=any (or really read=one as any is only supported by writes). One issue is it will still send the request out to all the servers in the quorum so even if you don't have to wait on a quorum you still put the read load on them.
The second and more important issue is that if you want consistent data the only way to use read=one is if you use write=all which means you would have no resilience to a server outage. So in a normal cassandra cluster where you have write=quorum a read=one can give you back no data. In my experience this happened frequently enough with long GC pauses on one of the nodes that it wasn't useful to use in production.
Great article, Damien. This idea that network partitions are exceedingly rare was the reason why ElasticSearch goes CA vs. the AP many other NoSQL datastores choose.
Not only are network partitions rare, the most disastrous case where the cluster splits in half is even rarer. Usually, you have a small part of the cluster partition away.
I hope people don't take this as a Dynamo vs. Couch discussion, because the relative importance of partition tolerance is a topic that spans all datastores that give up on ACID.
Network partitions are extremeley rare only for small clusters. For very large clusters or multi-datacenter clusters there is much more hardware than a single switch between servers. Then, the likelihood that something cuts off the whole room full of servers from the rest of your cluster is not something safe to neglect.
When your units of networking concern are "availability zones" (i.e. data centers) rather than just switches, wouldn't network failures now be more common than server failures?
Agreed - the assumption is that the MTBF of a single network device is a relevant statistic is undoubtedly incorrect in anywhere but a hobbyist's network.
even if switch failures are rarer, couch at W=1 will silently drop data for network partition, dynamo at W=2 won't, how is the comparison at the end valid?
"...if the client wanted true multi-node durability, then the write wouldn't have succeeded (the client would timeout waiting for replicas(s) to receive the update) and the client wouldn't unknowingly lose data."
Why compare MTBF of a single network switch to MTBF of a node? Why not compare MTBF of a single network switch to MTBF of a single CPU or a motherboard? Unless you're talking about hobby-size network, there is usually much more between the nodes than a single network switch.
"The original Dynamo design was based on a core set of strong distributed systems principles resulting in an ultra-scalable and highly reliable database system. Amazon DynamoDB, which is a new service, continues to build on these principles, and also builds on our years of experience with running non-relational databases and cloud services, such as Amazon SimpleDB and Amazon S3, at scale."
There are two main reasons why, when I was researching scalable databases, I primarily gravitated towards Dynamo-style replication (Cassandra, Voldemort, and at the time, Dynomite):
- There is no such thing as failover. Dynamo replication takes node failure in stride. This is what you want for a robust system where "Network Partitions are Rare, Server Failures are Not." Not only does it prevent temporary unavailability during the failover, it rules out an entire class of difficult, edge-case bugs. (Which every master-election-and-failover system out there has been plagued with.)
- It generalizes to multiple datacenters as easily as to multiple machines, allowing local latencies for reads AND writes, in contrast to master-based systems where you always have to hit the master (possibly cross-DC) for at least writes. (Couchbase is unusual in that it apparently forces read-from-master as well.) Cassandra has pushed this the farthest, allowing you to choose synchronous replication to local replicas and asynchronous to remote ones, for instance: http://www.datastax.com/docs/1.2/dml/data_consistency
/Cassandra project chair