The network is reliable

Dave_Rosenthal · on June 4, 2013

Disclaimer: I'm co-founder of FoundationDB.

This is a great blog post. It shows all of the weird faults that motivate the CAP theorem's relevance to real world distributed systems. I especially like the part about the NIC "dropping inbound but not outbound packets". Good times!

I want to highlight a conceptual jump that I think some people will make when reading this article that is incorrect. That jump is: "Because network partitions occur in the real-world, then we should choose CAP availability." In fact, what we actually need is systems that deal gracefully with partitions, regardless of whether they are CP or AP. An AP system needs to "reconverge" in a timely way, avoid long latencies, etc. A CP system needs to avoid split-brain scenarios, remain up despite some nodes being partitioned away, etc. Mostly, either type of system needs relentless testing that seems to be dangerously uncommon.

ryanobjc · on June 4, 2013

I'd also like to point out that the CAP proof doesn't really talk about "split brain mode"... it talks about "partitions" which are narrowly defined in the network model.

So even if one is "P"-protected in the "CAP" theorem sense, you might not be able to handle a full split brain partition.

aphyr · on June 4, 2013

So even if one is "P"-protected in the "CAP" theorem sense, you might not be able to handle a full split brain partition.

I don't think so. Split-brain (assuming you're talking about totally isolated network components, and not "disagreement about authoritative state between nodes") is a special case of a partition: any partition-tolerant system is tolerant of totally isolating partitions by definition. It might do that either by entering split-brain and becoming non-linearizable, or by refusing requests.

ryanobjc · on June 5, 2013

"by refusing requests"

So any system that can handle a split brain by crash failing successfully handles the situation?

The definitions here are so broad, and doesnt really match my experience, which is namely detecting a fully split brain is difficult for humans to do, harder for systems, and generally causes major problems.

Over-provisioning your system is one solution (eg dynamo with N=5), but that isnt acceptable to everyone...

aphyr · on June 5, 2013

So any system that can handle a split brain by crash failing successfully handles the situation?

We seem to disagree about what "availability" means in the context of CAP. Nothing has to crash. All that a serializable system has to do under partition is refuse to respond to at least one possible request. Exactly which requests is algorithm-dependent.

For example, a system with e3PC quorum reads and writes will handle partitions by refusing to allow reads or writes in any minority component. If it writes to all nodes and can read from any node, it will refuse to accept any writes during partition, but all nodes can service read requests consistently. Conversely, if it writes to any node and reads from all, it can always handle write requests, but reads will fail.

Pragmatically, the desire for single-node-failure tolerance leads most CP systems to accept operations only when a majority of nodes are accessible. Systems like Paxos, Viewstamped Replication, ZAB, and Raft provide CP semantics in exactly this way: a partition which divides a 5-node cluster into a 3-node component and a 2-node component will continue to work correctly in the 3-node component, and the 2-node component will refuse requests until the partition is resolved.

Not all systems will shut down a component completely. Most sharded CP databases will run, say, Paxos quorums between groups of three or five replicas, but overlap those replicas around the hash ring. This is the design described by Eric Brewer for Stasis, and is the planned design for Cassandra's CaS and Riak's CP write support. These systems offer serializability over any particular key (or shard, e.g. a single quorum), but not between keys. During failure, some keys will be consistently accessible in one component, and a disjoint set of keys will be accessible in another.

The definitions here are so broad, and doesnt really match my experience, which is namely detecting a fully split brain is difficult for humans to do, harder for systems, and generally causes major problems.

You are correct--designing CP algorithms is difficult. Luckily, you don't have to, because we have several already. You might start with

https://bitbucket.org/cometpeak/ringmaster/src/1e9fc082d1e5/...

http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F...

http://pmg.csail.mit.edu/papers/vr-revisited.pdf

https://ramcloud.stanford.edu/wiki/download/attachments/1137...

IgorPartola · on June 4, 2013

I have seen some really interesting network partitions between SoftLayer data centers. The scenarios I saw were actually very frustrating because these weren't complete partitions (in the CAP theory sense), but just really high latency between data centers that was very sporadic. The CAP theorem addresses the situation where there is 100% packet loss both ways, but you can get "decreased availability" and "decreased consistency" with high latency. In other words, if you plan on traversing the WAN with your system, your system should take into account the high latency situation as well as classical split brain.

P.S.: at some point we stopped having these issues and overall I think SoftLayer is a great provider, at least for their bare metal offerings.

bonzoesc · on June 4, 2013

This is a pretty exhaustive list of network issues and the software/data problems they cause. Thanks!

ryanobjc · on June 4, 2013

Interesting story, the Amazon datacenters were built in such a way they were reliant on WAN links between them to act as one.

So take your average datacenter, let's say it has 60,000 machines, and now you put it in to 3 datacenters of 20,000 machines each. For reliability reasons. (nevermind the math or common sense)

This is what amazon did. And the single fiber link between the datacenters had a tendency to get dug up by everyone pretty much. During these outages each DC had its external connectivity working.

And now you know why Dynamo was built the way it was. Full split brain mode was common because service owners would put 50% of their fleet in 2 datacenters. Usually EXACTLY 50% as well, since humans tend to "snap to" even numbers.

One last point, in the CAP theorem proof, "partitions" really just means loss of packets to/from one node. That is the smallest/general case.

Even if one can handle single node failure well (eg: dynamo systems) it may or may not be able to handle full split-brain mode. That is a lot harder and there isn't any real CS to help you here (not that the CAP theorem is helpful)

aphyr · on June 4, 2013

One last point, in the CAP theorem proof, "partitions" really just means loss of packets to/from one node. That is the smallest/general case.

I don't know where you got this idea from. The formal model in Gilbert and Lynch's proof is arbitrary packet loss, which is equivalent to an arbitrary pattern of isolation between groups (of any size) of nodes. See section 2.3:

http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pd...

Even if one can handle single node failure well (eg: dynamo systems) it may or may not be able to handle full split-brain mode.

Just so we're clear, Dynamo does handle totally isolated components during partition. This is an explicit goal of the paper, and it provides specific guarantees about availability in those circumstances.

That is a lot harder and there isn't any real CS to help you here (not that the CAP theorem is helpful)

There is a great deal of CS to help you deal with partitions. You might start with Lamport, Lynch, Gray, Brewer, Chandra, Cheriton & Skeen, etc...

fragmede · on June 4, 2013

While the article has collected an impressive list of public failures and post-mortems, I wouldn't exactly call them common. It also closes with the assertion that "some networks really are reliable. Engineers at major financial firms report that despite putting serious effort into designing systems that gracefully tolerate partitions, their networks rarely, if ever, exhibit partition behavior. Cautious engineering (and lots of money) can prevent outages."

What does lots of money actually get you in this case? The more esoteric the hardware, the less real-world testing I would expect it to have gone through. Something like Solaris on SPARC might avoid unreliable NIC/drivers combinations but that's only one of the failure modes listed.

Having a completely identical set of hardware purely for testing would be nice which is where lots of money would help, but some of these failures are so arcane, they sound like they'd be hard to replicate on duplicate hardware, never mind testing for them in the first place.

aphyr · on June 4, 2013

While the article has collected an impressive list of public failures and post-mortems, I wouldn't exactly call them common.

Remember, risk is a combination of frequency and severity. A low probability of catastrophic failure can be worth addressing: look at car crashes, for example. Moreover, some of these uncommon events have widespread consequences--remember the EBS/RDS outage? That partition took out hundreds of companies for several hours.

Part of the problem is that because networks and applications are very different from one another, and because we tend to fix known problems in systems, it's rare for us to see the same failure twice. That's part of why these failures trend towards the esoteric side.

Having a completely identical set of hardware purely for testing would be nice which is where lots of money would help, but some of these failures are so arcane, they sound like they'd be hard to replicate on duplicate hardware, never mind testing for them in the first place.

It varies. Amazon and Google are tightly constrained by computational costs: efficiency matters. Merrill Lynch, meanwhile, has some 27 billion dollars in annual revenue--and requires nowhere near Amazon's computational capacity to realize that income. They can afford to buy expensive hardware, to rack it in tightly controlled datacenters, and to, should they desire, rehearse network transitions on isolated redundant hardware. They have more predictable load and slower-changing application goals than most startups. They also have fixed operating hours for many services, which helps them meet strict SLAs.

All this comes at the cost of increased complexity, longer developer and operations engineering lead time, and higher capital costs. However it is possible to design networks with significantly reduced probability of failure, and a few organizations achieve that goal to varying degrees. We just wanted to acknowledge it in the post.

quanticle · on June 4, 2013

There's also the fact that when Amazon and Google have failures, those failures are public. They affect a large number of consumers and they're widely reported. If a Merril Lynch internal system goes down and costs the company $10 million, is it going to get reported? Will there be a public post-mortem that you can cite? Probably not. Companies of all sorts, especially financial companies, don't like airing their dirty laundry in public and will only do so when there is significant market or regulatory pressure to do so.

So who knows if Merrill Lynch, Morgan Stanley or Goldman Sachs' networks are as reliable as Amazon's or Google's? I don't see any of the former companies admitting to serious network events, much less posting public post-mortems that detail what, exactly, went wrong.

fragmede · on June 4, 2013

Excellent article, btw.

> They can afford to buy expensive hardware,

Simply saying their hardware is more expensive doesn't say much. Amazon and Google aren't running their data-centers on $50 Gig-E switches they picked up on sale from Fry's either - the switches in their data centers are also high-end switches that cost more than a car.

What about Merrill Lynch-grade expensive hardware actually avoid problems on this list? Do they have super-redundant systems where I can swap out a bad motherboard without any downtime - which exists, is super expensive, but still wouldn't have prevented half the issues on the list (in-and-of-itself anyway. Having a single monolithic system that 'never' goes down vs an N-machine HA setup would have avoided heartbeat-related issues since there is no heartbeat to do.)

> to rack it in tightly controlled datacenters, and to, should they desire,

Google and Amazon owns the datacenters in their entirety for some locations, so I'm curious what those tighter controls are/do.

> rehearse network transitions on isolated redundant hardware.

Isolated redundant hardware would help in certain test cases, but that doesn't test everything - often (and especially in the case of such esoteric corner-case issues), a live load is the only way to test the system.

> They have more predictable load and slower-changing application goals than most startups.

Oh absolutely, but that's not due to money or the engineers, but simply because their 'product' is strictly defined, whereas a startup may not even know what their product is, and which may change during their lifetime.

> They also have fixed operating hours for many services, which helps them meet strict SLAs.

Ah, that is certainly an advantage for keeping systems up - scheduled downtime. Taking the system offline to perform upgrades/other maintenance is certainly something that does not get enough attention.

> We just wanted to acknowledge it in the post.

Sure. I'm taking umbrage at phrase "on the other hand, some networks really are reliable" without solid evidence to back it up. Major financial firms have very little motivation to doing the same sort of public-disclosure and post-mortem that this article is really about, and furthermore the limited feature set and user base of their 'product' only serves to hamper any exposure it could have.

Somewhere inside that major financial firms who's networks 'rarely' exhibit partition behavior, is a sysadmin who h stories will never see the light of day, and we are seeing 'how the meat is made', so to speak.

When was the last time you did a search via Google that threw an exception vs. how much do we hear about their software breaking or their equipment failing (in that pseudo open fashion of theirs - who knows how many servers they actually have now).

josh2600 · on June 4, 2013

Merrill is buying top of the line Cisco gear. Google is making its own switches and routers because of costs. Cisco is expensive because the median time to failure is so high.

Google and Amazon own their datacenters, but they're playing the bargain basement game. You won't see racks upon racks of Nexus 7k gear in Google, at least not to my knowledge (please correct me if I'm wrong).

You're right that live loads are different from test loads (entropy) but having a separate set of test servers still holds a lot of value.

It's just like Aphyr said, when your tolerances are higher, but your costs of delivering said tolerances as an aggregate are lower, you can afford to over-provision. A lot of these failures come from blocked I/O or network capacity failures that just wouldn't happen in a production finance environment because of over-provisioning.

I hope this lends some light onto your queries :).

mindcrime · on June 4, 2013

Indeed. And consider the "black swan"[1] scenario... inductive logic and empirical observation are inherently flawed[2] as a means to determining "the truth", since the counter-example that disproves your theory can happen tomorrow. Or, in other words:

The network is reliable... until it isn't.

[1]: http://en.wikipedia.org/wiki/Black_swan_theory

[2]: http://en.wikipedia.org/wiki/Problem_of_induction

arielweisberg · on June 4, 2013

Based on my work supporting a variety of customers I would say that partitions are extremely common.

Don't focus on network partitions (or hardware based partitions) because that is only one potential cause of a partition. Partitions are legion. Any system that implements automated fail over has to account for them in a way that produces acceptable results for the use case.

spydum · on June 4, 2013

What I took away was what already was obvious: most outages are attributed to maintenance/changes. In a stable system, network events are exceedingly rare. It is the fact that we are constantly changing these networks that brings about instability..

aphyr · on June 5, 2013

Unless you have GC pauses, or run out of memory, or have flaky ram, or bought shoddy nics, or hit a kernel bug, or have a WAN link, or use EC2, or rely on a hosting provider or colo's network. ;-)