Hacker News new | past | comments | ask | show | jobs | submit login
Google App Engine Team's Post-mortem for February 24 Outage (groups.google.com)
48 points by anurag on March 5, 2010 | hide | past | favorite | 20 comments



Good to hear Google being open about the problem. Couple things:

- Doesn't each server has its own UPS? I thought some guy from Google showing the machine design with the battery pack attached to each machine. Why did a data center power outage crash 25% of the machines at the same time?

- Should the servers be restarted and recovered automatically? It said 25% of servers didn't get the UPS power in time and crashed. That implied power was restored shortly and the servers can just be restarted. The servers are just dumb servers that can be shut down, restarted, and recovered in any time. If enough servers were restarted, then BigTable/DataStore would have enough data nodes to continue. The downtime would be couple minutes to tens of minutes.

- Lack of precise monitoring. The first clue of problem was as drop in traffic and outside discussion group posts. Drop-in traffic can be caused by so many things. Shouldn't there be health check monitors on the BigTable and DataStore clusters? Like if more than 10% (25% in this case) of nodes in the clusters are down, raise an alarm and page someone right the way?

- Were there capacity planning and stress testing to determine what percentage of the capacity that can serve traffic in each DC and each cluster? 25% servers down affecting the whole BigTable cluster sounds too little of safety margin.

I don't envy the on-call staff. The pressure must be tremendous.


I seem to remember the on board batteries only last a few minutes - long enough to fail over to another datacenter, supposedly.


A lot of companies have nice official apologies but this one really stands out. Google is not just saying sorry, they are actually implementing serious changes which probably represents millions of dollars of development to help make sure this doesn't happen again.


This is the part that I find fascinating:

  7:48 AM - Internal monitoring graphs first begin to
  show that traffic has problems in our primary datacenter
  9:35 AM - An engineer with familiarity with the unplanned 
  failover procedure is reached
And the blog entry keeps sounding like there was only one engineer making decisions for the first two hours.


For a seriously detailed analysis of the challenges of provisioning App Engine, take a look at Ryan Barrett's video:

http://sites.google.com/site/io/under-the-covers-of-the-goog...


A strangely prosaic failure. A large portion of the entries there seem to be the on-call engineer simply trying to figure out what the failover procedure actually is. Once he finally gets a copy of it, it takes 16 minutes to get back up:

  9:53 AM - After engineering team consultation with the relevant
  engineers, now online, the correct unplanned failover procedure 
  operations document is confirmed, and is ready to be used by 
  the oncall engineer. The actual unplanned failover procedure for 
  reads and writes begins. 
  10:09 AM - The unplanned failover procedure completes, without 
  any problems. Traffic resumes serving normally, read and write. 
  App Engine is considered up at this time.


It's probably just a matter of configuring the routers to redirect traffic to the backup data center. All the servers are already running in standby in the backup data center.

The only thing tricky is to hold all writes on backup, and flush all pending updates from primary to backup. That they did: they made backup in readonly mode, and slowly turned on read/write after a while.

I don't understand why they lost so much data. 0.00002% is a lot considering the dataset is really large. Shouldn't there be replication from primary to backup? I assume Google is rich enough to have multiple links and fatpipes.


most open i've seen google about these kind of operations thing, very helpful of them, and I hope they continue to do it!


They do this every time there's a system or process failure that affects paying customers. They're also quite good about proactively crediting accounts when SLAs are exceeded. Not quite as good as Netflix, but better than most.


Here is an even more detailed and open posting from an outage in July 2009. I remember finding the timeline of events fascinating.

http://groups.google.com/group/google-appengine/msg/ba95ded9...


- Implement a regular bi-monthly audit of our operations docs to ensure that all needed procedures are properly findable, and all out-of-date docs are properly marked "Deprecated."

Surely that leaves a two-month window in which weird things can happen?

How about:

- All features/changes that could affect the document set require a documentation update, a documentation review and training of all relevant staff before deployment.

That should ensure the document set is consistent and the staff is aware of the changes. My thinking in general is to replace periodic reviews with processes that ensure the reviews aren't necessary.

Any improvements/suggestions/reasons why it wouldn't be better?


Sounds like a great way to drown the team in process. I don't mean to sound snarky but you often need to find the right balance between process and making sure the team can write code and ship features without having to spin up a ton of paperwork and training.


You don't have to use the same people to develop, document and train. Most of the time it's a very bad idea.

Also, each change in the dev environment doesn't kick off a bunch of admin. However, each change in a live environment with hundreds of thousands of customers, who in turn have businesses with collectively millions of customers, should be as close to perfect as you can get. You just can't get that if you only document something up to two months later.


Processes and rules are in place because of these kind of painful experience. No one likes process. No one likes to be under the gun to fix a down production server neither.


Surely that leaves a two-month window in which weird things can happen?

They could mean twice a month.


All features/changes that could affect the document set require a documentation update, a documentation review and training of all relevant staff before deployment.

They are talking about an audit that verifies that what you describe does in fact happen. Restating that certain things should happen is not going to help, as people are not usually aware of what they are not doing. To improve, you need to point out where they fall short. To be able to point that out, you need audits. Procedure are there to ensure nothing goes wrong. Audits are there to ensure procedures are followed.


I've been working on a standard guideline for postmortem communication, and ran this post against that template: http://www.transparentuptime.com/2010/03/google-app-engine-d...


Power is the great uptime equalizer. Show me the most elegant HA design you can dream up and chances are I'll show you a system that's one unexpected power failure away from failure.


Indeed, I can remember us all feeling quite nice and smug with our racks having dual A and B feeds from separate mega UPSs generators etc.

We didn't feel quite so smug when we actually found out that some of the racks actually had both sides plugged into the A feed - which was probably confusion between us, the data center owner and the electrician installing the feeds.


re the questions about replication across datacenters, see:

http://code.google.com/events/io/sessions/TransactionsAcross...

it discusses both app engine's approach and the underlying factors and tradeoffs that apply to any similar system.

for the executive summary, see slide 33 from:

http://snarfed.org/space/transactions_across_datacenters_io....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: