Hacker News new | past | comments | ask | show | jobs | submit login

> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.

I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.

I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.

Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.




I’m curious, have these plans ever been tested in a real incident?

Like Mike Tyson says, everyone has a plan until they get punched in the face.


The biggest key to implementing these types of plans is that when the shit hits the fan, you send a third of the people home - so they can come back in 10-20 hours are relieve those who are still there.

If you don't do that, you're still going to be scrambling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: