It is hard to pick the most absurd line. I quite enjoyed "2018: Scaling our infrastructure to keep the site up on Sunday’s (we were having constant outages every week)". Take a moment to reflect that implies achieving one 9 of reliability required an emergency all-hands response that bypassed their normal processes. I may not be an expert in high-reliability systems but that isn't how I expect the problem to be tackled.
My read is the management team are not very capable and that this gentleman is not a natural leader. Doesn't sound like he knows how to build systems in a planned and thoughtful manner and has embraced a classic lurching-into-crisis strategy for want of understanding any alternatives.
EDIT Also juxtaposes beautifully with the "‘keeping the lights on’ fallacy" he mentions. Those crazy engineers with their belief that they need to try and keep the lights on. Also, when I forcefully pull them away from that, why do the lights go out on Sunday???
> I may not be an expert in high-reliability systems but that isn't how I expect the problem to be tackled.
You probably know way better than me, but in my experience, configuring things correctly on healthy hardware gets you 99.99% by default. Adding some surplus capacity adds another 9, at least.
Then you build from there (autoscaling, hardware failovers, etc. etc.).
That does depend somewhat on what you count as "downtime". Larger distributed systems may not suffer a complete outage in the whole year, and more than one of the services my team runs has that level of successful response ratio.
People who setup an infrastructure which "just works" don't get promoted. You want to be extinguishing fires and be seen doing it. To be the hero who worked their ass off all week-end to save the company. That's how you think you'll get that juicy promotion.
From my experience, this is only valid for immature companies and/or immature management. A good tech lead or manager knows that if there are no problems, the infra people are doing their jobs very well.
My read is the management team are not very capable and that this gentleman is not a natural leader. Doesn't sound like he knows how to build systems in a planned and thoughtful manner and has embraced a classic lurching-into-crisis strategy for want of understanding any alternatives.
EDIT Also juxtaposes beautifully with the "‘keeping the lights on’ fallacy" he mentions. Those crazy engineers with their belief that they need to try and keep the lights on. Also, when I forcefully pull them away from that, why do the lights go out on Sunday???