20+ hour outage due to EC2/EBS on BitBucket

gfodor · on Oct 3, 2009

Unfortunately this doesn't sound like an EBS issue but a systems architecture flaw. I apologize in advance if my analysis is wrong here, but I think it's important to understand.

We use EBS extensively on our infrastructure. Occasionally an EBS volume will fail. We ran into an issue where a volume had a spike in IO load as I think was the case here.

EBS volumes are not magic, they are just chunks of physical disks. Disks fail. You should have the system architected so that you can handle such failures.

Here are a few things you can do (we don't do all of these, but enough to assure we won't lose data or have downtime in the case of a failure.)

- Mount several EBS volumes, use raid.

- If its a database, set up a failover node on a separate EBS volume.

- Take regular snapshot backups.

- Take regular full backups to S3.

It's also very important to have everything highly automated. If an EBS volume fails for us, its one command to switch to the failover node. If that doesn't work, its another command to spin up a new machine off the last available snapshot with a few hours of data loss (worst case scenario.) Everything is highly monitored with nagios + ganglia so we know when bad stuff happens.

The two or three times we've had issues with EBS we were able to either switch to a failover node or take a snapshot and mount a new volume from there. I haven't set up RAID on EC2, but I'd imagine this also a very good route to protect your data.

Remember, the cloud isn't magic. The only advantage you get with the cloud is rapid provisioning and unlimited capacity if you need it. You still have to build a shared nothing, reliable architecture within the framework the cloud gives you. We've found EC2 and EBS to work out very well, but of course there were growing pains as you learn very quickly where your single points of failure are! I get the sense that the overall reliability of resources such as instances or volumes, on the whole, is definitely lower than what you'd expect in a standard hosting provider, whatever the reason may be.

Edit: Of course, you could also load your data on a distributed data store like Cassandra as well that handles some of this failover and replication magic automatically, too.

jespern · on Oct 3, 2009

You're absolutely right that the cloud is not magic, but you do get some guarantees with EBS. From their website:

"Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component."

We don't keep the database on the same EBS, and we have segmented database traffic out to several EBS volumes (for WAL, etc.) That's not the issue.

We take regular snapshot backups. We didn't lose any data. We have everything, we just can't get to it.

Regardless of what might make sense in this situation, it's not working for us. We've moved both our instances and the volumes to different availability zones, to no avail.

I just received a call from AWS engineering, assuring us that we are currently their top priority, and a team of engineers are working to fix the problem. They're seeing the issue on their end, and fortunately for them, it seems rather isolated to our instance.

Could we have taken precautions to prevent this problem? Maybe. We hadn't, cause we didn't anticipate a problem as exotic as this one. The only way to keep persistent data on EC2 is using EBS, and right now, it doesn't work for us, at all. This is not a common problem that could've been solved with backups or snapshots, or whatever.

keefe · on Oct 3, 2009

>The only way to keep persistent data on EC2 is using EBS, and right now, it doesn't work for us, at all.

S3 should work too? Unless it was a global EBS failure, you should be able to restore from any backup to a new set of instances and stores, why doesn't that work?

jespern · on Oct 3, 2009

...As data you can access as a filesystem. S3 is great, but pretending it's a filesystem is going to get you awful performance.

As I said, our data is not lost, we have snapshots and backups, it's sitting right there on the mount, we're just not getting any sort of acceptable throughput. New instances does not fix the problem.

Ironically, we were looking into having S3 as the backend for our data, for scalability/redundancy purposes, but this pretty much puts a stop to that.

keefe · on Oct 3, 2009

Oh, I wasn't suggesting pretending it's a file system - I had been thinking of a place to dump the data for backups, thinking fresh instances + fresh EBS would solve the problem. I think you answered this already in the other post - that you booted a new instance and a new EBS with some backup and the problem remained?? This seems like such a horrendous failure on AWS' part, unless it has something to do with how you are accessing the EBS (too many connections or something). I could understand if a given EBS fails, but if you can restore the data from an independent backup and spin back up with new instances and new EBS this indicates a very concerning systemic problem in EBS!

jespern · on Oct 3, 2009

(I don't know why I can't reply to the post below, so I'll reply to myself):

Yes, we did try this, and it produced the same problem.

gfodor · on Oct 3, 2009

When we had the problem we fixed it by snapshotting the screwed up volume, and creating a new volume from that snap. Did you guys try this?

jespern · on Oct 3, 2009

I'm here to answer questions if there are any (I run Bitbucket.)

jespern · on Oct 4, 2009

It was fixed around 4am (GMT+2) last night, with the assistance of Amazon. I'm just going to summarize what happened here:

We were attacked. Massive UDP DDOS. The flood of traffic prevented us from accessing our EBS store with any acceptable speeds, which is what caused everyone to think the problem was between our EC2 and the EBS. Of course this also explains why booting up a new instance and EBS didn't help anything.

Also, it's happening again now, and we're working with Amazon to remedy it once more.

tlrobinson · on Oct 4, 2009

Is there anything Amazon could have done to prevent this (or at least made diagnosing it easier), or is it a problem with your particular application?

jespern · on Oct 4, 2009

We're talking UDP flood here, saturating our bandwidth. It never reached our servers, it just ate all the bandwidth on our connection. I guess what Amazon could have done is be quicker in spotting the DDOS and take measures to prevent it.

spudlyo · on Oct 4, 2009

So you never saw any evidence of this DDOS yourself? I'm somewhat skeptical of this explanation. It seems to me with shared infrastructure it'd be difficult to saturate just one customer's connection. It also doesn't make sense to me that this could be done without the traffic ever reaching your server. You used the phrases "our bandwidth" and "our connection" do things really work this way on the AWS cloud?

Anyway, I'm really sorry you guys had to go through all of this, and I hope whatever it is that caused it is fixed.

tlrobinson · on Oct 4, 2009

So it was actually entirely unrelated to EBS? The reason it was taking 10 seconds to do an "ls" was simply a saturated connection to your server, not too much EBS activity?

keefe · on Oct 3, 2009

What do you think of the other post critcizing your architecture? I am doing my own arch work on EC2, so I am trying to understand exactly what caused this failure. Is this a problem with your instance being able to access any EBS? Why couldn't you spin up another instance with a fresh EBS from a backup and redirect DNS to that instance?

jespern · on Oct 3, 2009

When this sort of thing happens, it's very easy to point out all the things you should've done differently. In retrospect, everything's easier.

You can't anticipate everything, and as I've pointed out in another comment here, this one is rather exotic.

Quick summary of what the problem is: We have an EBS volume. It mounts fine, appears fine. The problem is that it's excruciatingly slow. We can't serve data from the volume at any speed, really. Running an "ls" takes over a minute, in a small directory.

All systems are running, everything should be fine, but seeing as we can't read the data fast enough, we've been forced to put a static page explaining what's going on.

Booting a new instance, re-creating the volume from a recent snapshot, doesn't help. The exact same problem persists. Why? We don't know. Amazon's figuring it out.

We're doing everything we can do remedy the problem, but unfortunately right now, that consists of our team drinking coffee to not fall asleep, waiting for the final call from Amazon telling us they've sorted it out.

gfodor · on Oct 3, 2009

Ok, that answers my other question. I think the fundamental issue (aside from the amazon issue) is you had bytes living on a single EBS disk that weren't replicated to another disk. For important data, this is probably a bad idea regardless of backup strategy, etc.

Edit: By the way, the point here isn't to say "you guys screwed up" but to underscore that these types of issues aren't 100% Amazon's fault either, both parties had issues and the tone of your post seems to be "EC2 and EBS are not reliable we are switching off of it" when the truth lies somewhere in the middle.

keefe · on Oct 3, 2009

I don't think that this was a replication issue based on this comment :

>Booting a new instance, re-creating the volume from a >recent snapshot, doesn't help. The exact same problem >persists. Why? We don't know. Amazon's figuring it out.

If you can recreate the volume from a snapshot and hit it with fresh instances and run into the same problem, this is quite worrying. If it had resolved after a restore from backup, I would have felt better about EBS. As I see it there are only these options :

1) There is a general, systemic failure in EBS. You ran into it and highlighted it to AWS and they are fixing some problem. If other people are not having the same problem as you, I would be more inclined to think of #2.

2) Some usage pattern violates an assumption that was made when EBS was designed and screws it. Restoring from the backup reproduces the usage pattern. This could be simultaneous connections or # of distinct files in the volume, for example. One way to test this would be to split the data in the drive into a larger number of smaller EBS-es (EBSii? whatever the plural(: ) or throttle the simultaneous connections and see what happens.

did I miss anything?

idlewords · on Oct 4, 2009

Yes, the actual cause, a problem on the wire between them and Amazon. UDP flood eating up available bandwidth.

I didn't guess it either :-)

keefe · on Oct 4, 2009

So, do you think setting a smaller max size on your EBS instances would have avoided this by spreading the traffic, so if you were using 1TB using 10 100GB ones instead and federating queries across them?

jespern · on Oct 3, 2009

I apologize if that's the tone I'm relaying. I guess I'm just frustrated due to the time it's taking to fix it.

gfodor · on Oct 3, 2009

Yea, it sounds like you guys are pretty much at Amazon's mercy right now, which sucks. I think a better way to look at these types of things is "what could we do to prevent this from happening again" and make a post about EBS gotchas.

Since it sounds like you guys are talking about straight up file storage I'd guess a good option would be to set up an HDFS cluster and be smart about locality/replication to minimize the latency.

Edit: Oh, and one more thing is take a backup that doesn't involve EBS snapshots. Maybe biweekly dump the entire sucker to S3 or something or have it getting pushed there all the time. This is something we've been meaning to do since the snapshotting capabilities of EBS are still a bit too magical for me to sleep well at night. (To be fair though, they've worked great when we've needed them to.)

shizcakes · on Oct 3, 2009

What are you thinking in terms of movement / failover at this point?

jespern · on Oct 3, 2009

We've been contacted by several hosting providers, and right now, Rackspace seems pretty nice.

The problem isn't that we don't have failover here, it's that we store all repositories on a single EBS volume. This has worked great for us in the past, but as of last night, that volume has become virtually unavailable to us. It doesn't matter which instance we mount it on, the throughput we get from it is excruciating.

If, or at this point--when, we move, the disk architecture will look different, and general failover will be less of an issue.

Amazon has for the past 8-10 hours been investigating the issue, and we're left pretty dumbfounded as of to what has happened exactly. I'll summarize everything in a blog post once the chaos is over.

pvg · on Oct 4, 2009

Site seems to be back up, any better idea what happened?

shizcakes · on Oct 3, 2009

Ugh. I am working on the next gen architecture for our site, and I wanted to focus on hosting it in a cloud - but all these outages give me no confidence that cloud-based hosting is really all that ready for primetime yet.

neurotech1 · on Oct 3, 2009

IHMO AWS Windows instances may not be ready for prime time. This issue seems not to affect Linux based instances.

There are other cloud providers like the YC favorite, SliceHost - I have no direct experience with them but may soon try them out.

jrockway · on Oct 4, 2009

Slicehost is not really a "cloud provider". You buy a VPS and use it "forever".

jbellis · on Oct 4, 2009

Well, it's both. Rackspace Cloud Servers (basically, Slicehost post-acquisition) has an EC2-like api to spin up and down servers on demand, billed by the hour.

durana · on Oct 3, 2009

Everything fails. Design systems that minimize the impact failures have on your customers. Moving to another provider isn't going to fix the problem. Data like this should be stored in more than one place.

wmf · on Oct 3, 2009

In reality, replicating across two clouds is difficult and expensive. It's quite possible that Bitbucket wouldn't exist at all if they had used such an architecture.

joevandyk · on Oct 3, 2009

I'm looking at cassandra as a possible solution for things like this.

jbellis · on Oct 4, 2009

Seems like a popular theme right now -- one of the github guys is working on a cassandra git backend at http://github.com/schacon/agitmemnon, and paul querna of ASF infrastructure is looking at doing the same for svn.

durana · on Oct 3, 2009

That's an over generalization. And difficult and expensive relative to what? When evaluating the cost of solutions to problems like this, the basic idea is to compare the cost of the solution and the cost of the problem (if realized) with the likelihood of the problem being realized factored in.

It's quite possible Bitbucket will cease to exist if they continue to have problems like this. I'm sure their customers are not happy. It does sound like they have plans to fix this once they get back up and running.

rogerthat · on Oct 3, 2009

Despite this, Bitbucket is great. Private repository with a free account option - you don't get that on GitHub.

jrockway · on Oct 4, 2009

OTOH, since Github has some of my money, they have a bit more of an obligation (and incentive) to keep their servers up. Free services come and go as the owner pleases. (Hello, ma.gnol.ia.)

tomjen2 · on Oct 4, 2009

Not really - Bitbucket have gotten quite some of my money.

And yes I am considering chancing that, but mostly because I can't find a good mercurial client for windows.

tve · on Oct 12, 2009

From what I can piece together, it seems the real problem isn't EBS, it's that the security groups are implemented at the host level (the machine on which your instance runs). This means that the UDP flood reached your host where it got dropped due to the security group rules, but it still had a performance impact, on EBS in your case, just because of the sheer volume of packets. The trouble was that nobody could see these packets and diagnose the problem correctly. If you had temporarily allowed-all into your security group and done a tcpdump you'd have gone "whoa!" and headed into the correct direction to fix the problem. Interesting...

zaph0d · on Oct 3, 2009

Dear HN readers: What according to you could be one way of architecting the storage so as to avoid similar AWS EBS outages in the future?

moe · on Oct 3, 2009

Money. You buy a second set of instances in a different availability zone and failover to it in case of problems. You buy a second datacenter at a different ISP, keep it in sync and failover to it when your primary fails.

Eventually you architect your application to distribute load over multiple facilities and to become resilient against component failure.

Until then: You do nothing, grab a beer, relax and wait as the amazon guys sweat their asses off to fix it. You pat yourself on the back because it is not your ass in the trenches right now.

On top of that you have the perfect excuse for the followup blog-post.

Kayem · on Oct 3, 2009

An unfortunate event, however, re-iterates to system admins why the cloud should only be used as a low tier of storage -- for now.

moe · on Oct 3, 2009

What you are saying makes no sense.

Everything fails occassionally. Amazon probably has a team of highly specialized engineers on the task right now, working under the pressure of a few dozen disgruntled customers and under the eyes of worldwide press.

Could your company respond with an equal intensity if this was your own hardware? Will your SAN supplier whip his staff on-site as fast as they will for amazon?

Kayem · on Oct 3, 2009

"Everything fails occassionally. Amazon probably ..."

I wish I could use that line to explain to my company why our business has come to a complete halt.

Granted, your business is completely web-based and in the cloud, which is why I specifically made a mention to system admins why the cloud is not reliable enough to be a high-level tier of storage yet. Why doesn't that make sense? I wasn't trying to offend you or your decisions.

Also, yes, my company and my SAN supplier would have staff on-site. But we have control over our own hardware, so there's really no comparison.

moe · on Oct 3, 2009

I wish I could use that line to explain to my company why our business has come to a complete halt.

It's a completely valid and reasonable business decision.

For most companies the risk of amazon downtime is simply not a deciding factor when held against what it would cost to maintain an own datacenter with remotely similar properties.

a high-level tier of storage yet. Why doesn't that make sense?

I guess I didn't get what you mean by "high-level tier storage"? Most companies have at most two tiers: Live and Snapshot-Backup. If you're a bank or fortune XXX with truly multi-tiered storage then yes, your inhouse staff might be able to do it better. But it will probably cost quite a bit more than ec2 and the business case for that is imho rather the exception than the rule.

But we have control over our own hardware, so there's really no comparison.

Well, I think you overestimate your capabilities there (unless you are a fortune 500). Amazon doesn't face downtimes over disk or server failures - and neither would you. The real question is who can debug and resolve complicated failure modes faster (you know, nasty stuff, heisenbugs).

Not meaning to offend you either but my money would be on amazon. That's why I questioned your broad statement of "not a high-level tier storage". How much higher level than backed by a 50.000-servers operation can it get?

Kayem · on Oct 3, 2009

I completely agree with your points. Again, I was speaking to people who manage their own SANs, and may be looking to use the Cloud as an additional tier of storage, with the same reliability as a local array. Reliable in the sense that they would never have to worry about Internet latency, network nodes going down, or anything else that they have absolutely no knowledge or control over and could potentially affect the performance/operation of an application, ultimately disabling me from meeting business requirements.

If there are no business requirements to meet, I have no arguments, the Cloud is where I'm at!

jrockway · on Oct 4, 2009

Even Google is down from time to time.

Basically, Bitbucket is having some downtime because of Amazon. But if it wasn't Amazon, it could be something else -- failing hardware, earthquakes, disgruntled datacenter employees, whatever. At least the Bitbucket folks can theoretically sit back and relax while Someone Else fixes this (rare) hardware problem.

danielrhodes · on Oct 4, 2009

To be fair, cloud storage is a huge step up in terms of reliability for small to medium size startups where the capital necessary to roll out their own hardware and people necessary to maintain that hardware is not within their budget. A lot of startups resort to very risky setups because they can't afford increased reliability, and Amazon and Rackspace seem to solve that problem. For a larger company, it makes more sense to expend capital on reliability because they can afford it and they want to hold the ax when something bad occurs.

datums · on Oct 4, 2009

Did you guys try mounting a new empty volume on a new instance ? To see if it was all EBS or just that EBS. Have you thought of not using EBS and using s3 as the data store ?