There's a couple of good stories about massive outages and good incident response. Mine is just one of them (and at some level, I was very lucky).
There's also the one where all the frontend servers worldwide went into a crash loop from a bad configuration push. The SRE doing the push noticed some "weirdness" and rolled back even before the full scope of the issue was known. That one's in the SRE book.
There's also the one where all the frontend servers worldwide went into a crash loop from a bad configuration push. The SRE doing the push noticed some "weirdness" and rolled back even before the full scope of the issue was known. That one's in the SRE book.