Hacker News new | past | comments | ask | show | jobs | submit login

"Disaster recovery" is a term. I'm not saying it's a disaster in the sense that some horrible thing has happened. Like most people said, it's maybe annoying to some of Twitter's most avid users but fairly innocuous.

The point is that at least in the case of Heroku and EC2 (I'm not sure what caused Twitter's outage yet), the causes of failures weren't something like a "16-sigma" event like a plane hitting an electrical tower (a tragic event that happened in Palo Alto a couple of years ago). They were things like insufficiently tested software and processes, and misconfigured devices. These things do not add cost, except maybe incrementally more man-hours in terms of testing and auditing of configurations. They are not million-dollar diesel units that require permits, etc.

My point is that if a device misconfiguration can take down EC2, or a single bad data in their data stream can cause massive failure, it means that the entire system is much more fragile than its been sold to everyone. If they didn't realize this was the case, it means that the risk of failure was a lot higher than they had assessed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: