> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.
I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.
I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.
Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.
The biggest key to implementing these types of plans is that when the shit hits the fan, you send a third of the people home - so they can come back in 10-20 hours are relieve those who are still there.
If you don't do that, you're still going to be scrambling.
I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.
I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.
Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.