Hacker News new | past | comments | ask | show | jobs | submit login

There's a couple of good stories about massive outages and good incident response. Mine is just one of them (and at some level, I was very lucky).

There's also the one where all the frontend servers worldwide went into a crash loop from a bad configuration push. The SRE doing the push noticed some "weirdness" and rolled back even before the full scope of the issue was known. That one's in the SRE book.




Site Reliability Engineering.[0] Google's SRE book is a pretty interesting read.

0. https://landing.google.com/sre/interview/ben-treynor.html


GFE? SRE?


SRE is Site Reliability Engineer; GFE is the "Google Front End"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: