Hacker News new | past | comments | ask | show | jobs | submit login

I'm looking at this a bit differently. My reading of this is "a series of subtle and bizarre failures combined in a way which nobody could ever have anticipated". I think I'm a pretty good architect and coder, but I would never claim that I could design a system which couldn't fail in this sort of way -- in fact, "a background task is unable to complete, resulting in it gradually increasing its memory usage, ultimately causing a system to fail" is the one-line description of an outage Tarsnap had in December of last year.



That's the problem. As a designer your goal is not to claim that you can design unflawed system. Instead it is to use all your humility (and skills) to design stuff that are simple enough that are unlikely to fail because the complexity level reaches the limit of prevention and ability to analyze the failure modes.

I would like to know how much of the design in places like AWS is made more complex by the requirements of HA itself, but my guess is, a lot.


I certainly didn't mean to imply that they should have predicted this -- my reason for scoring the number of simultaneous issues is to indicate what a shitstorm this was.

That said, there are some genuine deep-rooted design flaws at work here, as others have pointed out, primarily Amazon's use of EBS for critical services in their own cloud.


That sounds a lot like people using cron for complicated tasks that repeat every 5 minutes (db queries for example). Before you know it, the jobs pile on top of each other locking the DB and spiraling out of control.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: