So what? 24/7 builds up to fragility like the NYC subway system or has huge cost...

paxys · on Sept 18, 2019

A website is not a subway system. There are no "huge costs" to having basic rolling deployments set up. In fact it's likely more expensive to have extended outages and maintenance windows, because it shows that there are humans running commands behind the scenes rather than automation.

stevenpetryk · on Sept 18, 2019

Just wondering (sort of—I'm also slightly trying to point out that you're being a bit of an armchair Reddit ops person :)

What makes you think that the issue Reddit faced today would be fixed by rolling deploys?

paxys · on Sept 18, 2019

I definitely am (but we are on Hacker News after all).

Without knowing anything about this particular issue, at a basic level I'd think any non-trivial site should have a system of rolling out changes to a small percentage of servers and have automated monitoring that fires alerts if it notices any disruptions in functionality. If nothing goes wrong then gradually roll out to 100%. If something does then stop the rollout and go back to the last stable version. Then engineers can look at what went wrong.

hobofan · on Sept 19, 2019

From my experience that mostly only applies to the application layer. Especially updates on distributed databases are often not as easy as "stop and rollback" as you have to achieve a stable quorum or other similar dynamic states like that which you can't easily achieve from restoring a snapshot. There your best shot often is to learn how to deal with all the failure scenarios manually and do as many tests on a staging system until you feel comfortable.

reilly3000 · on Sept 18, 2019

The revenue per minute of these sites is in the hundreds of thousands of dollars. While that can be budgeted for if needed, its not just inconvenient for users, it could mean missing a revenue target.

manigandham · on Sept 18, 2019

No site is that high other than Google or Facebook. Reddit is actually pretty poor in revenue generated and can easily make up for lost inventory with more ads on the page.

jzwinck · on Sept 18, 2019

I don't think Reddit grosses a hundred billion USD per year. Do you? That would put it very close to Microsoft.

reilly3000 · on Sept 18, 2019

Right. Reddit's ad revenue is forecasted at 100M this year, so ~$190/min. I was more referring to Facebook's $65B annual revenue, which is $123,668 / min. That's like an FTE's annual salary each minute, so throwing people at uptime has lots of ROI.

triceratops · on Sept 19, 2019

A subway system serves only one timezone. A website can potentially need to serve every timezone.

ping_pong · on Sept 18, 2019

It's more the fact that the engineers working there should be embarrassed, not that it's important. You would think that the engineers there would be able to improve their infrastructure after a few years but it's still as flaky now as it was a few years ago. That doesn't speak well of the caliber of engineers there.

steve_adams_86 · on Sept 18, 2019

That's unfair. Engineers don't always get to exercise their skills or best practices if their CTO gets in the way. Or if other external obstacles block the CTO. I've worked with very talented people who are totally wasted because the company just doesn't care to get the tech right when it comes down to making that investment.

Autonomy to get things right when it's the right call is sort of a rarity and a real luxury in my experience. I've built a ton of things I knew could have been built better if the rest of the company saw the value in doing so.

When given the option to do your job as asked or quit because you don't have your way, I'd rather just do my job and wait for opportunities to improve things if they present themselves.