I get the impression that one of the biggest issues was missed. They did not test the standard load against the secondary server, they assigned a machine of lower specs to the task and there's nothing in the future actions that indicates they'll change it... Even if they go for the new and shiny, they can end up in the same situation when their master fails.
I hope they just overlooked that in the blog post, rather than actually not correcting this first.
I have no opinions on MongoDB, but it really seems like this particular problem was because they skimped on disaster recovery, ie. their failover hardware was less powerful than their production hardware. That was the root cause of their downtime, which is inadequate planning.
That's spending money on car insurance, but realizing only after you get into an accident that the car insurance covers almost nothing. It means you've wasted your money paying for the insurance. They paid for the secondary failover hardware, but it was effectively useless since they were down for 2 days. The only thing it mitigated, possibly, was how long they were down for, but the primary objective of the hardware, ie. keep them up in case of a disaster, was a complete failure.
I've worked at a company that was completely down for a day worldwide due to a "disaster", even though we had spent millions on diesel fuel generators, etc. I blame the "checkbox" mentality where people only look to satisfy requirements, but no one actually has ownership over the process and the details. Unfortunately, in my case, no one got fired over this complete misstep, which is another problem... zero accountability.
Seems like Netflix's chaos monkey is not a bad idea actually. I don't mean you have to kill your services randomly while there are users on them... but switching from your master to secondary (why are you even making a distinction anyway?) should be a pretty standard operation.
Even normal upgrades (hardware fails - it's a question of when, not if) could be handled transparently just by making the "secondary" server a first-class citizen.
It's also advisable to try and keep the core dataset (which you absolutely depend on) as light as possible.
Split out the heavy stuff on to other servers.
Then have emergency flags in the webapps so you can run them in a low feature mode. If you bake this concept in when you're building the webapps dealing with drama is much less stressful.
This seems like such an obvious issue that I can't understand how they overlooked it. If you are using failover as a strategy, your failover machine has to be up to the task.
I hope they just overlooked that in the blog post, rather than actually not correcting this first.