A website is not a subway system. There are no "huge costs" to having basic rolling deployments set up. In fact it's likely more expensive to have extended outages and maintenance windows, because it shows that there are humans running commands behind the scenes rather than automation.
I definitely am (but we are on Hacker News after all).
Without knowing anything about this particular issue, at a basic level I'd think any non-trivial site should have a system of rolling out changes to a small percentage of servers and have automated monitoring that fires alerts if it notices any disruptions in functionality. If nothing goes wrong then gradually roll out to 100%. If something does then stop the rollout and go back to the last stable version. Then engineers can look at what went wrong.
From my experience that mostly only applies to the application layer. Especially updates on distributed databases are often not as easy as "stop and rollback" as you have to achieve a stable quorum or other similar dynamic states like that which you can't easily achieve from restoring a snapshot. There your best shot often is to learn how to deal with all the failure scenarios manually and do as many tests on a staging system until you feel comfortable.
The revenue per minute of these sites is in the hundreds of thousands of dollars. While that can be budgeted for if needed, its not just inconvenient for users, it could mean missing a revenue target.
No site is that high other than Google or Facebook. Reddit is actually pretty poor in revenue generated and can easily make up for lost inventory with more ads on the page.
Right. Reddit's ad revenue is forecasted at 100M this year, so ~$190/min. I was more referring to Facebook's $65B annual revenue, which is $123,668 / min. That's like an FTE's annual salary each minute, so throwing people at uptime has lots of ROI.
It's more the fact that the engineers working there should be embarrassed, not that it's important. You would think that the engineers there would be able to improve their infrastructure after a few years but it's still as flaky now as it was a few years ago. That doesn't speak well of the caliber of engineers there.
That's unfair. Engineers don't always get to exercise their skills or best practices if their CTO gets in the way. Or if other external obstacles block the CTO. I've worked with very talented people who are totally wasted because the company just doesn't care to get the tech right when it comes down to making that investment.
Autonomy to get things right when it's the right call is sort of a rarity and a real luxury in my experience. I've built a ton of things I knew could have been built better if the rest of the company saw the value in doing so.
When given the option to do your job as asked or quit because you don't have your way, I'd rather just do my job and wait for opportunities to improve things if they present themselves.