Hacker News new | past | comments | ask | show | jobs | submit login

Craigslist has had the benefit that its infrastructure is very simple, and takes almost no maintenance to run. Something at the complexity level of Facebook would probably die in weeks if the servers were left to run unmaintained. That may happen to Twitter if the slow bleed continues and the service becomes unprofitable to run.



> Something at the complexity level of Facebook would probably die in weeks if the servers were left to run unmaintained

I'm not sure that's true. The site runs best when most employees are out of the office during the last week or two of December and when employees are busy writing peer reviews on the last day or two before that portion of the review cycle ends.

Craigslist without maintenance probably turns into just scams and spam.


Well, yes. Hate me, but slowing down changes and only applying changes very deliberately makes a system more stable. The idea of continuous deployments in order to reduce the size of changes, in order to reduce the risk of changes is good. But if the goal is reliability, it's hard to compete with the idea of only doing well-planned, well-coordinated changes geared towards improving reliability.

For example, we have a process based on the idea of the downtime budget from the SRE book. If the downtime allowed by the SLA of our customers is spent halfway, all changes to the systems have to be approved by leadership. Cosmetic bugs can be tolerated for two weeks in order to not risk anything.

On the other hand, if the necessary maintenance of the supporting infrastructure of a system is stopped, the system is running towards a cliff. It certainly depends on how deep the stack goes and how many redundancies and self-healing ideas are built into a system. If you have redundant storage arrays with redundant drives, with redundant systems built on top of these, with smart failover strategies, the overall infrastructure can tolerate a terrifying amount of damage while still running reliably.

But once that redundancy degrades and rots away, and once that resulting final linchpin drive or system instance fails, it'll result in an unsalvageable clusterfuck very, very quickly. Especially if you fire important core SRE and ops staff.


I think you may be thinking of feature pauses, which are when all sites run best, but they still very much have 24/7 SRE coverage.


> the service becomes unprofitable to run.

When was Twitter profitable ?!


2018/2019




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: