Netflix: Post-mortem of 22 Oct AWS degradation

dusing · on Oct 29, 2012

"On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors."

Our sys admins couldn't get to reddit.

jedberg · on Oct 29, 2012

Actually, that is completely true. I was trying to post a link to my Airbnb talk, noticed reddit was down, and then noticed Airbnb was down too.

One thing we might start doing is actually having alarms when two or more major AWS sites go down.

samstave · on Oct 29, 2012

Where is the link to said AirBnB talk?

jedberg · on Oct 29, 2012

https://www.airbnb.com/techtalks

azylman · on Oct 29, 2012

This is completely unrelated to the post, but I was at the Airbnb tech talk and it was extremely interesting - thanks for putting that on!

snprbob86 · on Oct 29, 2012

> If you like thinking about high availability and how to build more resilient systems, we have many openings throughout the company

Money can't buy recruiting opportunities like these. This is exemplary engineering and marketing.

ericcholis · on Oct 29, 2012

I enjoy reading these Sysops articles from Netflix. They provide a pretty good blueprint for working inside the cloud.

3amOpsGuy · on Oct 29, 2012

They are good. I'd enjoy reading more from their ops guys - likely others wouldn't though :-( a significant portion of ops concerns would probably appear unsexy to non ops people, but horses for courses, whatever floats your boat and all that. I love it.

waven · on Oct 29, 2012

3amOpsGuy, is there any chance I could email/contact you somehow regarding some ops advice? thanks!

3amOpsGuy · on Oct 29, 2012

3amopsguy@gmail.com

confluence · on Oct 30, 2012

I saw a comment a few weeks ago where one fellow HNer ran his entire startup on AWS spot instance pricing - so that he was forced to program in a state of continuous chaos as his demand/spot instances popped randomly and continuously into and out of existence while his service was running. It's like programming on quicksand.

This is probably a step too far - but maybe it is a natural extension of NFLX's Chaos monkey.

If you want ~100% up time with no QOS degradation - your system must be constantly under catastrophic attack.

This is probably a major reason why vol based risk models in finance are completely pointless.

Value at risk of any investable securities (including cash) is 100% all the time - any other number is bullshit. All volatility based risk models are useful if you like watching squiggly lines or pricing options, but they are essentially a random anchor that helps us sleep at night (anchoring bias).

signifiers · on Oct 29, 2012

Key line of the article: “Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.”

As an architecture design, the choice to avoid EBS is hotly debated, though many high-profile systems besides NetFlix (SimpleGeo, Sprint.ly) have moved almost exclusively to EC2 instance-backed (local disk) VMs and as a result, avoided the pain of the last 3 major AWS outages.

EzGraphs · on Oct 29, 2012

Netflix has spent a lot of time and energy devising solutions that minimize disruptions. Are any of you Cloud-Savvy folks using Netflix's tools on Amazon? If so, are you using them off-the-shelf or did you need to customize them to suit your site? In particular:

Asgard is cited in the post as making a "zone evacuation" relatively straightforward.

Astyanax (their Cassandra client) is designed with "smarts" that allow it to choose from available nodes should one or more be unavailable.

These (and several other tools) are available from Netflix at Github:

https://github.com/Netflix

shuw · on Oct 29, 2012

If this practice was widely adopted, I wonder if AWS would experience a bank run/DOS in the affected and neighboring zones.

jedberg · on Oct 29, 2012

It would certainly make reservations a lot more important.

Terretta · on Oct 30, 2012

You don't have to wonder. Every single AZ AWS postmortem says that's exactly what happens.

smoyer · on Oct 29, 2012

My impression is that the Netflix team understands AWS better than Amazon does ... But certainly better than most other AWS customers. Kudos

ams6110 · on Oct 30, 2012

OTOH, most AWS customers are not Netflix and couldn't afford the sort of high availability architecture Netflix has.

smoyer · on Oct 30, 2012

Not "OTOH" ... what you've said is the truth. Some of the methods Netflix uses still apply but others are indeed financially non-viable. I'd love to have a Simian Army of my own and I think that would translate to any PAAS provider.

crb · on Oct 29, 2012

One of the more interesting findings from the most recent AWS outages is that Elastic Load Balancing (ELB), the best-practice way to handle multi-AZ deployment, has a dependency on EBS. I know from various talks that Netflix try not to use EBS these days, but I wonder if you had any ELB problems, and how you might have coped if you had evacuated a zone but your LBs were out?