While the article has collected an impressive list of public failures and post-m...

aphyr · on June 4, 2013

While the article has collected an impressive list of public failures and post-mortems, I wouldn't exactly call them common.

Remember, risk is a combination of frequency and severity. A low probability of catastrophic failure can be worth addressing: look at car crashes, for example. Moreover, some of these uncommon events have widespread consequences--remember the EBS/RDS outage? That partition took out hundreds of companies for several hours.

Part of the problem is that because networks and applications are very different from one another, and because we tend to fix known problems in systems, it's rare for us to see the same failure twice. That's part of why these failures trend towards the esoteric side.

Having a completely identical set of hardware purely for testing would be nice which is where lots of money would help, but some of these failures are so arcane, they sound like they'd be hard to replicate on duplicate hardware, never mind testing for them in the first place.

It varies. Amazon and Google are tightly constrained by computational costs: efficiency matters. Merrill Lynch, meanwhile, has some 27 billion dollars in annual revenue--and requires nowhere near Amazon's computational capacity to realize that income. They can afford to buy expensive hardware, to rack it in tightly controlled datacenters, and to, should they desire, rehearse network transitions on isolated redundant hardware. They have more predictable load and slower-changing application goals than most startups. They also have fixed operating hours for many services, which helps them meet strict SLAs.

All this comes at the cost of increased complexity, longer developer and operations engineering lead time, and higher capital costs. However it is possible to design networks with significantly reduced probability of failure, and a few organizations achieve that goal to varying degrees. We just wanted to acknowledge it in the post.

quanticle · on June 4, 2013

There's also the fact that when Amazon and Google have failures, those failures are public. They affect a large number of consumers and they're widely reported. If a Merril Lynch internal system goes down and costs the company $10 million, is it going to get reported? Will there be a public post-mortem that you can cite? Probably not. Companies of all sorts, especially financial companies, don't like airing their dirty laundry in public and will only do so when there is significant market or regulatory pressure to do so.

So who knows if Merrill Lynch, Morgan Stanley or Goldman Sachs' networks are as reliable as Amazon's or Google's? I don't see any of the former companies admitting to serious network events, much less posting public post-mortems that detail what, exactly, went wrong.

fragmede · on June 4, 2013

Excellent article, btw.

> They can afford to buy expensive hardware,

Simply saying their hardware is more expensive doesn't say much. Amazon and Google aren't running their data-centers on $50 Gig-E switches they picked up on sale from Fry's either - the switches in their data centers are also high-end switches that cost more than a car.

What about Merrill Lynch-grade expensive hardware actually avoid problems on this list? Do they have super-redundant systems where I can swap out a bad motherboard without any downtime - which exists, is super expensive, but still wouldn't have prevented half the issues on the list (in-and-of-itself anyway. Having a single monolithic system that 'never' goes down vs an N-machine HA setup would have avoided heartbeat-related issues since there is no heartbeat to do.)

> to rack it in tightly controlled datacenters, and to, should they desire,

Google and Amazon owns the datacenters in their entirety for some locations, so I'm curious what those tighter controls are/do.

> rehearse network transitions on isolated redundant hardware.

Isolated redundant hardware would help in certain test cases, but that doesn't test everything - often (and especially in the case of such esoteric corner-case issues), a live load is the only way to test the system.

> They have more predictable load and slower-changing application goals than most startups.

Oh absolutely, but that's not due to money or the engineers, but simply because their 'product' is strictly defined, whereas a startup may not even know what their product is, and which may change during their lifetime.

> They also have fixed operating hours for many services, which helps them meet strict SLAs.

Ah, that is certainly an advantage for keeping systems up - scheduled downtime. Taking the system offline to perform upgrades/other maintenance is certainly something that does not get enough attention.

> We just wanted to acknowledge it in the post.

Sure. I'm taking umbrage at phrase "on the other hand, some networks really are reliable" without solid evidence to back it up. Major financial firms have very little motivation to doing the same sort of public-disclosure and post-mortem that this article is really about, and furthermore the limited feature set and user base of their 'product' only serves to hamper any exposure it could have.

Somewhere inside that major financial firms who's networks 'rarely' exhibit partition behavior, is a sysadmin who h stories will never see the light of day, and we are seeing 'how the meat is made', so to speak.

When was the last time you did a search via Google that threw an exception vs. how much do we hear about their software breaking or their equipment failing (in that pseudo open fashion of theirs - who knows how many servers they actually have now).

josh2600 · on June 4, 2013

Merrill is buying top of the line Cisco gear. Google is making its own switches and routers because of costs. Cisco is expensive because the median time to failure is so high.

Google and Amazon own their datacenters, but they're playing the bargain basement game. You won't see racks upon racks of Nexus 7k gear in Google, at least not to my knowledge (please correct me if I'm wrong).

You're right that live loads are different from test loads (entropy) but having a separate set of test servers still holds a lot of value.

It's just like Aphyr said, when your tolerances are higher, but your costs of delivering said tolerances as an aggregate are lower, you can afford to over-provision. A lot of these failures come from blocked I/O or network capacity failures that just wouldn't happen in a production finance environment because of over-provisioning.

I hope this lends some light onto your queries :).

mindcrime · on June 4, 2013

Indeed. And consider the "black swan"[1] scenario... inductive logic and empirical observation are inherently flawed[2] as a means to determining "the truth", since the counter-example that disproves your theory can happen tomorrow. Or, in other words:

The network is reliable... until it isn't.

[1]: http://en.wikipedia.org/wiki/Black_swan_theory

[2]: http://en.wikipedia.org/wiki/Problem_of_induction

arielweisberg · on June 4, 2013

Based on my work supporting a variety of customers I would say that partitions are extremely common.

Don't focus on network partitions (or hardware based partitions) because that is only one potential cause of a partition. Partitions are legion. Any system that implements automated fail over has to account for them in a way that produces acceptable results for the use case.

spydum · on June 4, 2013

What I took away was what already was obvious: most outages are attributed to maintenance/changes. In a stable system, network events are exceedingly rare. It is the fact that we are constantly changing these networks that brings about instability..

aphyr · on June 5, 2013

Unless you have GC pauses, or run out of memory, or have flaky ram, or bought shoddy nics, or hit a kernel bug, or have a WAN link, or use EC2, or rely on a hosting provider or colo's network. ;-)