I'm looking at this a bit differently. My reading of this is "a series of subtle...

antirez · on Oct 27, 2012

That's the problem. As a designer your goal is not to claim that you can design unflawed system. Instead it is to use all your humility (and skills) to design stuff that are simple enough that are unlikely to fail because the complexity level reaches the limit of prevention and ability to analyze the failure modes.

I would like to know how much of the design in places like AWS is made more complex by the requirements of HA itself, but my guess is, a lot.

seldo · on Oct 27, 2012

I certainly didn't mean to imply that they should have predicted this -- my reason for scoring the number of simultaneous issues is to indicate what a shitstorm this was.

That said, there are some genuine deep-rooted design flaws at work here, as others have pointed out, primarily Amazon's use of EBS for critical services in their own cloud.

darkarmani · on Oct 27, 2012

That sounds a lot like people using cron for complicated tasks that repeat every 5 minutes (db queries for example). Before you know it, the jobs pile on top of each other locking the DB and spiraling out of control.