How SmugMug survived the Amazonpocalypse

josephruscio · on April 25, 2011

I'm finding a lot of these articles about "surviving" the outage fairly frustrating. They generally boil down to a combination of the following:

1. "We use a multi-AZ strategy!" - This outage affected multiple AZ's concurrently. If you did not see downtime, this means you were fortunate to have at least one unaffected AZ. This is pure luck however, many sites with the same level of preparation had significant downtime. (Note: A multi-AZ strategy is sage and would have minimized your downtime, but does not warrant a survival claim in this case.)

2. "We aren't using EBS!" - Not a single article I've seen has claimed that they weren't using EBS because they feared a multi-day/multi-AZ outage. They weren't using it because it lacks predictable I/O performance in comparison to S3. You can't retroactively claim wisdom in the category of availability for this choice.

3. "We don't host component <X> on AWS!" - Taking this argument to it's logical end, any service that doesn't host on AWS could write one of these articles e.g. "We host on Rackspace so we didn't go down!"

In short, if you don't have a completely multi-region strategy (including your relational data-store) implemented purely on AWS, your blog post is decreasing the signal-to-noise ratio on this issue.

hopeless · on April 25, 2011

Best quote: "Start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you"

It sounds stupid but if you really do have a resilient and redunant infrastructure it shouldn't matter. If you fear someone randomly unplugging things then you have work to do ;-)

sfrench · on April 25, 2011

Netflix (who also survived the outage) wrote a blogpost last year in which they talked about a system they call "Chaos Monkey" which does this exact thing.

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

joegester · on April 25, 2011

I don't think they meant that they were doing that on live systems. Sounds like a debugging tool.

boredguy8 · on April 25, 2011

If the point is "test your supposed redundancy," I agree 100%. But please, for your customers' sake, don't do it in the middle of the day, especially if it's your first test.

SoftwareMaven · on April 25, 2011

To be clear, he did say start practicing during planned maintenance times, but I agree: your first try on a live system shouldn't be during peak times.

cagenut · on April 25, 2011

I think smugmug's cloud/colo hybrid is more likely to become the norm than the all-cloud dream of not having to deal with hardware anymore. When it comes to the "undifferentiated heavy lifting" aws wins. s3 for bulk storage, ec2 for asynchronous computing, cdn's for edge/delivery. But when it comes to your core data (meta data? the 64bit picture_id as opposed to the 2megabyte jpg) you just cannot beat raid10 ssd type colo'd setups right now.

Essentially I think we're going to be in an 80/20-ish cloud/colo sweet spot situation for years to come.

lordmatty · on April 25, 2011

Well done Smugmug.

Perhaps you should diversify into cardiac monitoring!

Joakal · on April 25, 2011

A little bit of information on their configuration in the past; http://don.blogs.smugmug.com/2008/06/03/skynet-lives-aka-ec2...

joevandyk · on April 25, 2011

One thing I'm curious about -- if you want to spread your instances over multiple zones and you are using postgresql, how do writes work? Won't latency to the master be slow if instances in one zone are trying to write to the master located in the other zone?

whakojacko · on April 25, 2011

Latency between different availability zones in the same region is generally pretty good. "Over the last month, the median is 2.09 ms, 90th percentile is 20ms, 99.th percentile is 47ms. This is based on over 250,000 pings -- one every 10 seconds over the last 30 days." from http://www.quora.com/What-are-typical-ping-times-between-dif...

Certainly 10% being 20ms or more is a little troubling, but if this is only for writes (ie reads come from a slave in the same AZ) you are probably ok.

rgrieselhuber · on April 25, 2011

Great article. It would be nice to know what they are doing for whatever database they are using (mysql, etc.) because they are not using RDS / EBS.

onethumb · on April 25, 2011

Yes, I'm seriously overdue on a blog entry about current state-of-the-art for DBs at SmugMug. :( You can watch my keynote from the MySQL conference two years ago to see what we used to do, but things have progressed since then. http://don.blogs.smugmug.com/2010/04/15/my-mysql-keynote-sli...

teoruiz · on April 25, 2011

That would be awesome. +1 on my side.

snewman · on April 25, 2011

While the article doesn't answer this, it does explain how they manage to run a database on AWS without depending on EBS: their database isn't hosted by Amazon at all. "...the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance." So, while the actual solution they're using might be of interest, it wouldn't say much about how to host a database in AWS without relying on EBS.

dajobe · on April 25, 2011

It also doesn't explain how if they don't trust EBS they are going to replace their MySQL on dedicated servers. Sounds like they are considering the Cassandra route. Which gained a query language last week.

petedoyle · on April 25, 2011

Very interesting having the DBs hosted in another datacenter. I've always assumed it'd add too much latency, but it looks like that's not the case.

Here's a traceroute between an EC2 instance in us-east-1a and rackspace.com (which resolved to one of their VA datacenters): http://pastebin.com/RF5VrTic

Sub 2ms. It also looks like the us-east-1a is peered directly with whichever rackspace datacenter served the request.

OstiaAntica · on April 25, 2011

The AWS issues in the NoVa center are continuing today, our RDS is still not fully accessible.

Terretta · on April 25, 2011

I've been interested to see AWS status page misrepresent this through its icons while annotating the continuing issues.

1) Long before EBS API was returned, AWS adjusted the "Amazon Elastic Compute Cloud (N. Virginia)" status[1] for 24 April to show operational (green). This has since been corrected in the "Amazon EC2 (N. Virginia)" Status History.

2) Their own RDS service, which is instances backed by EBS, remained unavailable for its users, proving that #1 was false. If they couldn't operate a service (RDS) built on themselves (EC2) normally, the underlying service (EC2) should not have been considered Operational in the status page.

3) At present, the icon for "Amazon Elastic Compute Cloud (N. Virginia)" is Green for "Service is operating normally" instead of Yellow for "Performance issues", though the text description is not "Service is operating normally." but "Instance connectivity, latency and error rates."

4) It seems from anecdotal observation they're using the status page at least as "median status", or perhaps closer to "20th percentile status", meaning >80% of something can be down before it toggles to "Service Disruption".

[1] http://status.aws.amazon.com/

idonthack · on April 25, 2011

The tl;dr:

>we don’t use Elastic Block Storage (EBS), which is the main component that failed last week.

pilif · on April 25, 2011

This gets even more weight when you consider that EBS was broken across multiple availability zones, which means that, had they used EBS, their first point would be invalidated.

mikeryan · on April 25, 2011

More importantly smugmug was smart enough, when moving to the cloud to realize which components were the most failure prone and to stay away from those.

Not using EBS wasn't luck it was a conscious decision.

SoftwareMaven · on April 25, 2011

They didn't not choose it because of concerns about availability, they didn't choose it because of run-time performance concerns. I don't think you can argue that those concerns even imply anything about availability, much less have some kind of causal relationship.

SmugMug got lucky in their choice. If performance had been consistent with EBS, they would have used it and most likely gone down like so many others.

onethumb · on April 25, 2011

Not true. Our primary decision was based on unpredictable latency, but the fact that we didn't/don't trust EBS played a huge role. EBS mucks up our basic availability scenario - systems are no longer individual, disposable, replaceable units. I'm sorry if that wasn't clear from the blog post - I'll go re-read that part and update.