I get the impression that one of the biggest issues was missed. They did not tes...

steve8918 · on March 31, 2012

That's my takeaway as well.

I have no opinions on MongoDB, but it really seems like this particular problem was because they skimped on disaster recovery, ie. their failover hardware was less powerful than their production hardware. That was the root cause of their downtime, which is inadequate planning.

That's spending money on car insurance, but realizing only after you get into an accident that the car insurance covers almost nothing. It means you've wasted your money paying for the insurance. They paid for the secondary failover hardware, but it was effectively useless since they were down for 2 days. The only thing it mitigated, possibly, was how long they were down for, but the primary objective of the hardware, ie. keep them up in case of a disaster, was a complete failure.

I've worked at a company that was completely down for a day worldwide due to a "disaster", even though we had spent millions on diesel fuel generators, etc. I blame the "checkbox" mentality where people only look to satisfy requirements, but no one actually has ownership over the process and the details. Unfortunately, in my case, no one got fired over this complete misstep, which is another problem... zero accountability.

viraptor · on March 31, 2012

Seems like Netflix's chaos monkey is not a bad idea actually. I don't mean you have to kill your services randomly while there are users on them... but switching from your master to secondary (why are you even making a distinction anyway?) should be a pretty standard operation.

Even normal upgrades (hardware fails - it's a question of when, not if) could be handled transparently just by making the "secondary" server a first-class citizen.

mtkd · on March 31, 2012

It's also advisable to try and keep the core dataset (which you absolutely depend on) as light as possible.

Split out the heavy stuff on to other servers.

Then have emergency flags in the webapps so you can run them in a low feature mode. If you bake this concept in when you're building the webapps dealing with drama is much less stressful.

calpaterson · on March 31, 2012

This seems like such an obvious issue that I can't understand how they overlooked it. If you are using failover as a strategy, your failover machine has to be up to the task.