I think this is another good example of how we as an industry are still unable to adequately assess risk properly.
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Whoever was in charge of disaster recovery obviously didn't really understand the risk.
That's not necessarily true. People don't die when twitter is down, and whatever twitter's business model actually is, I am not even sure there is a monetary penalty to them being down (unlike, say, Amazon being down which results in lost orders). They may have made the calculation that it was not cost effective engineering-wise to chase that extra 0.001% of reliability.
[Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
> [Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
And here I was hoping we could just take down Twitter and live forever...
This is going to lead to a loss of confidence amongst many - investors, advertisers, partners - who thought that Twitter was finally past its initial scaling/teething issues. It's not like there is anyone that hasn't heard of Twitter at this point in time. So yes, yes there can be bad publicity once something reaches the scale of twitter.
An e-commerce site being down does not lead to exactly the amount of orders lost as is average for that time. Most people will try again later, with the possible exception of first time buyers and likely exceptin of impatient commodity buyers with alternative accounts.
I would argue that people can die when Twitter goes down. In Mexico citizens are using Twitter to alert each other when Drug War violence erupts: readwriteweb.com/archives/shouting_fire_in_a_crowded_hashtag_narcocensorship.php
Maybe they need to tackle the underlying drug problem, rather than twitter fix it's downtime problem? The world revolved before twitter, and it will continue to revolve long afterwards.
I don't know the internal workings of Twitter, but if it isn't significantly easier to improve its downtime than to fix Mexico's drug violence, I'd argue they've made some non-trivial engineering mistakes somewhere.
"I think this is another good example of how we as an industry are still unable to adequately assess risk properly."
It is likely that what you mean by "properly" is impossible. At large enough scales, what you end up with is a Gaussian distribution of errors in accordance with the Central Limit Theorem... except that there's a Black Swan spike in the low-probability, high-consequence events, and you basically can't spend enough money to ever get rid of them. Ever. Even if you try, you just end up piling equipment and people and procedures which will, themselves, create the black swan when they fail.
I think you're trying to imply that if only they'd understood better, this could absolutely have been prevented. No. Some specific action would probably have been able to avert this but you simply don't have a 100% chance of calling those actions in advance, no matter how good you are.
The state space of these systems is incomprehensibly enormous and there is no feasible way in which you can get all the failures out of it, neither in theory nor in practice.
Living in terror of the absolute certainty of eventual failure is left as an exercise for the reader.
I didn't imply at all that all failures can be prevented. I'm saying that most peoples' assessment of risk are usually wrong. And the occurrence of a single point failure that can take down an entire system that is deemed low-risk seems to happen an awful lot.
It not only occurs in the technology industry, but also even in things like financial risk analysis. For example, people could mitigate the risk of a bond defaulting by buying a credit default swap. However, most people failed to assess the risk of their counter-party going belly up, like AIG or Lehman. This failure in risk assessment is in large part why the financial crisis was so widespread.
Another striking example is the failure of Fukushima I and II. Basically what happend was that they had a power failure (of the external line)! They thought that was somehow too unlikely to account for, which I really have trouble with understanding. Isn't it obvious that multiple systems can fail because of some unaccounted for event external event? One that affects both, on-site and external power? And in the Japanese case, the trigger was not even something you'd need much fantasy for, an earthquake. Japan is sitting directly on one of the biggest geo faults on earth, and they don't account for simultaneous power outage!
So, yes, many if not most people are very poor risk assessors.
Then again, this might be rather a capacity (or the lack thereof) of the organization within the risk is assessed. This is what the software engineering quip "Most problems are people problems" means, I think. In some environments it is hard to bring up the unlikely, catastrophic scenarions without being seen as overly pessimistic and somehow not enough subscribed to the success of the undertaking as a whole. So you assess risk success overly optimistic in order to further your career, not to assess risk accurately.
Pedantic footnote: the Fukushima Daichi plant failure wasn't just a power failure: they had multiple backup diesel generators and batteries, and the diesels kicked in after the earthquake hit and the reactors tripped and they lost the grid connection. The problem was that all the diesel generators and fuel were at ground level and the sea wall wasn't high enough to keep the tsunami from flooding them a few minutes later.
If they'd had a couple of gennies on the rooftops, they ... well, they wouldn't have been fine but they'd have had a fighting chance to keep the scrammed reactors from melting down. Or if they'd had a higher sea wall (like Onagawa) they'd have been fine.
So: not one, not two, but four power systems failed in order to result in the meltdowns -- and one of them would have worked if they had been located slightly differently.
(Otherwise, your point about people being poor risk assessors is spot-on. And worse: even if some people are acutely conscious of risk, once decision-making responsibility devolves to a committee, the risk-aware folks may be overruled by those who Just Don't See The Problem.)
Yes, I know the full scenario was a bit more involved.
My main beef with their failure handling is actually this: you need to be able to face a situation where _all_ you smart emergency systems fail. In the case of an NPP this can mean an almost global environmental crisis, and need relocate millions of people, making hundreds of square kilometers uninhabitable, etc. In that case you don't really want to rely on five generators on some roof. Which may or may not work on that day.
And this is not something I make up here now. I remember discussing nuclear safety in high school, and the bottom line was: NPPs are ok, since they become uncritical when _everything_ fails, because the moderator rods slide down into the reactor vessel.
But after Fukushima I read, that actually the situation there, with that specific model, is different, unfortunately. Tough luck. Because that model still needs some cooling because the fully moderated reactor still produces 1% it's total energy, an that is enough to bring the reactor into an 'undefined' state, iirc. And it is easy to imagine what that means for a station that has just been struck by an earthquake anyway.
My whole point is: your comment makes it appear as if the security layers actually were plenty, and I would (respectfully, of course) disagree with that. I think it was poor.
That point is important: if NPPs aren't build safely, what is then built safely? My guess is: nothing.
So what to do? Design for failure. (Politically, technically, economically, can be applied everywhere.)
So what to do? Design for failure. (Politically, technically, economically, can be applied everywhere.)
Yup.
On a similar note, the French response to Fukushima Daichi is rather interesting (France relies on nuclear generation for over 80% of its electricity):
"The ASN has also come up with an elegant technical solution to get around the (universal) dilemma of how to protect a plant from external threats, such as natural disasters. The report recommends that all reactors, irrespective of their perceived vulnerability, should add a 'hard core' layer of safety systems, with control rooms, generators and pumps housed in bunkers able to withstand physical threats far beyond those that the plants themselves are designed to resist."
(And a mobile emergency force who can move in and stabilize a reactor after an unforseen catastrophic disaster that kills everyone on-site and destroys most of the safety systems.)
In other words, they now expect unpredictable Bad Things to happen and are trying to build a flexible framework for dealing with it, rather than simply relying on procedures for addressing the known problems.
": you need to be able to face a situation where _all_ you smart emergency systems fail. "
Totally agree with you here - Until recently, no nuclear power plan was designed such that it could survive failure of all their emergency systems. Hopefully, with the negative repercussions of Fukushima on the industry, engineers are rethinking their approach to Nuclear Power.
> I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Is this really a valid conclusion to come to at this point? I expect downtime in any service I operate. It's just how the world works. Does that mean I don't understand the risks and am misleading the board?
Any assessment of risk entails certain larger assumptions about the world, many of which often turn out to be mere guesses.
Consider all the prices that are set to their current levels b/c nobody expects the collapse of the US political system to occur. Yet there is a nonzero probability that it will occur.
On one hand this seems like an absurd example, yet it exemplifies the kind of blind spot we are prone to when assessing risk. We generally address all the risks we can directly control, then classify the rest as "systemic" which essentially means that we are not able to compute them so we're going to ignore them.
Yet many systems which we assume to be stable or predictable (governments, companies, markets, weather patterns, social trends, etc.) have unexpected aberrations now and then which can have very significant consequences. Since these tend to impact most companies equally, the market will converge on an equilibrium where no firms do anything to hedge against these things.
Do you want to pay extra bank fees so that your bank can hedge against the collapse of the US currency for your checking account? Probably not. Do you want to triple your hosting costs to hedge against a massive US power grid failure? Probably not. The same applies to asteroid risk and sudden ice age risk.
On the other hand, if you have lots of money saved, you may wish to hedge against the collapse of one currency or another, and if your business would end if you suffered a few hours of downtime, you might want to invest in massive amounts of redundancy.
Every morning when we all commute to work we risk death. Some exposure to systemic risk is considered acceptable, and part of the character of any person or business is the kind of risk exposure we tolerate day to day. A doctor working in an AIDS clinic risks needle sticks and HIV, a startup doubling its users each month risks downtime but also risks a cash flow crisis.
Your examples describe two entirely different systems. The failover of a software product is drastically different from the failover of a power system. Trying to map everything back to a common best practice under the category of "risk" seems like it would miss out on important intricacies.
Risk management is about determining how to identify risks, as such, it is applicable everywhere. However, much like security is applicable everywhere, securing Fort Knox is a very different endeavor than securing a web site.
That's not really fair, we as humans are bad at risk analysis. Bruce Schneier has written extensively on the role of cognitive biases et al in measuring perceived risk.
I've been looking for years for a good reference that says "human are bad at probability estimation", but this one doesn't seem to work: he's simply saying that we're very good at estimating regular risks (if I understand correctly).
You've never done DR have you? It's a business process with a cost and there are RTO's (recovery time objectives) and RPO's (recovery point objectives). Systems can and will go down. So long as the recovery meets the defined objectives, then DR has been performed correctly. There is a limited amount of money and resources that businesses can spend on DR, COOPs, etc. You should understand that.
One hour RTO and RPO will cost way more than 24 hour recovery. Edit... and the business managers decide how much they wish to spend on DR. It's a trade-off and anyone who has ever done it, understands that.
I'd guess the higher ups at Twitter have run the cost benefit in their head (and probably many spreadsheets) plenty of times, and in most cases spending your limited resource on disaster recovery preparation just isn't worth it. Their site being down does not qualify as a "disaster" - they'll be back up soon, then we'll all be tweeting away again within minutes.
"Disaster recovery" is a term. I'm not saying it's a disaster in the sense that some horrible thing has happened. Like most people said, it's maybe annoying to some of Twitter's most avid users but fairly innocuous.
The point is that at least in the case of Heroku and EC2 (I'm not sure what caused Twitter's outage yet), the causes of failures weren't something like a "16-sigma" event like a plane hitting an electrical tower (a tragic event that happened in Palo Alto a couple of years ago). They were things like insufficiently tested software and processes, and misconfigured devices. These things do not add cost, except maybe incrementally more man-hours in terms of testing and auditing of configurations. They are not million-dollar diesel units that require permits, etc.
My point is that if a device misconfiguration can take down EC2, or a single bad data in their data stream can cause massive failure, it means that the entire system is much more fragile than its been sold to everyone. If they didn't realize this was the case, it means that the risk of failure was a lot higher than they had assessed.
This is a great point, and this isn't just about Twitter but also about many other sites and services that seem to depend on it. It looks like a lot of people have created a distributed system version of dependency hell for themselves, where they rely on a multitude of third parties not to change behaviour or go down. Additionally, in many cases and perhaps perfectly legitimately from a cost-benefit perspective, the envisaged way to recover from this kind of problem is to assume that people can quickly and frantically hack their way out of it at short notice.
(expanding my reply) No risk assessment in the world will stop a cage monkey from tripping over a pile of 1Us and falling onto the big red button. Figure out what your pain threshold is and live with it.
I know of a chap, called Terry, who became known as Total Network Terry. TNT. Because he accidently plugged the wrong cable into the wrong jack in the wrong cable cupboard on the wrong floor.
Just finding it took days to sort out, apparently.
That's like when I unplugged what I thought was our T1 line, but it turned out it was a neighbor's, who happened to be an ISP, which was a new business direction we were just moving into. And they didn't believe it was an accident...
What I was trying to point out is that catastrophe is inevitable and risk assessment isn't a panacea. I'd rather invest money in proactive countermeasures to unknown risks than try to think of all the things that could go wrong (which you'll never have the money in your life to fix anyway).
But also to the thread parent's comment about Twitter execs not realizing there was a minor risk of catastrophic failure: Executives only care that their money-making baby keeps running. In the past i've seen execs demand that an engineer call them at 3AM if the production site goes down for more than 5 minutes... even though that call is pointless. I think they just assume there's no point in getting involved with the plan because the plan will never be perfect, but at least they can be aware of a problem so they can cover their asses and tell a higher-up that it's being worked on. At the end of the day, even the guys at the top don't really give a shit about the product, they just care about their paycheck.
not necessarily. even if the expected value of taking a risk is positive, taking the risk can be undesirable, e.g. due to large variance.
statistical distributions that include a risk aversion parameter exist precisely to model this kind of problem (unfortunately their names have slipped my mind, otherwise I'd provide links)
It is simply a matter of perceived value and cost benefit. Why would a cio spend millions on Dr when the probability of diaster is so minute that the risk manager cannot even calculate it? Ok there is a risk that a plane will hit the pdc. .00000002%. And ultimately will our business grind to a halt? Or can we use a manual workaround until backups recover to sdc and we capture data lost since last backup. I mean I have a hard time taking this sort of risk seriously unless I'm running dialysis machines and someones life is at risk.
> ... and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Risk management consultants already exist! Many companies just choose to assess it internally, or the consultants themselves are inexperienced (often lacking practical experience).
I think that a lot of you guys are confusing "Disaster Recovery" with "Business Continuity".
Disaster Recovery is a reactive approach. It's what you do to get things back up AFTER a system or site has failed.
Business Continuity is a proactive approach. It's what you do to ensure that your critical services will remain viable whenever disaster occurs.
In the cases of Heroku, Amazon, Twitter, and many more, their Disaster Recovery strategies have been successful. The fact that they came back online without major data loss is proof of that. Their business continuity strategies, however, have been found wanting.
We don't exactly know if the services troubles in this down times you cite were caused by a disaster, so maybe even the disaster recovery thing may be lacking.
I hope they write up a post-mortem on the fallout (hopefully it won't be a post-mortem of Twitter). Those things are always extremely interesting with big infrastructure like this.
Always fun when you're developing against an API, and then have to perform a frantic investigation to work out if your latest code change broke everything... or it's just the API endpoint itself.
In ruby, at least, there are some good tools to automate this for you. Using, say, vcr, you can automatically "record" your API calls (into "cassettes") as your tests run; these data are then used when you run your unit tests subsequently. When you plan to integrate/push, simply delete the "cassettes" and run your tests again. That way, any API changes are picked up prior to integration.
I have a suite here that takes about 3 minutes to run from scratch, but just over 1 second as unit tests.
Unit test ( as already mentioned) and build up static json/xml/whatever files to run your code against so that you can integration test your stuff without integration testing the 3rd party stuff.
At some point you end up outside of the scope of testing, but you can build bootstraps to make sure that your mocks match whatever they are mocking.
Obviously, this does you no good when something like an API service goes down, but that's issue is irrelevant to the question that was posed. Having some static file to test against whether it is accurate or not will instantly tell you whether it is your code or the third party that is breaking.
Sure, being upset/getting angry just because of a little bit of Twitter downtime is stupid, but that doesn't take away from the fact that one of the biggest and most important discussion and communication channels the web has is completely down.
I don't believe it is newsworthy. In fact, it's sad that it is being written about because it shows the utter shallowness of what passes for 'Silicon Valley News' at the moment. And it's sad to see this as a 'top story' on Hacker News taking up space.
I think it's only newsworthy in this space - where, as some of us are the ones responsible for these systems - we are trying to learn why this happened in order to prevent it.
I agree that learning from why Twitter was down will be interesting and when that story comes out I hope it will be high on Hacker News. But this sort of news story about something that's happening right now is symptomatic of the useless '24 hour news' cycle of noise.
> But this sort of news story about something that's happening right now is symptomatic of the useless '24 hour news' cycle of noise.
The single most important part of the internet is the immediate availability of news (to me, anyway). I've never heard anyone complain about that before; why do you think it's not worth knowing and talking about events as they happen? 'Twitter is down' isn't noise. Years and years ago there were stories about fire departments (SF I think?) that started using Twitter to send out fire notices. Here in Montreal, the police tweet very quickly and accurately about our daily student protests.
Twitter is extremely important to a huge number of people, and when it goes down, a site like HN definitely should be talking about it. It's big news and it's almost exclusively relevant while it's happening.
So, here in the UK Twitter is back for me. It looks like I was without it for about 45 minutes. Call it an hour for a nice round figure.
Think about these two scenarios:
1. During that one hour you spend your time focussed on talking about this event as it's happening, speculating, having an emotional response (because you can't access something you want to and find a group of people experiencing the same thing and all get together to experience the frustration).
2. Tomorrow you read a story that says "Twitter was down for one hour yesterday" with some detail about what happened.
I believe that the latter is preferable. It's more efficient, less emotional and more useful. The former is the same as watching some 'Breaking News' event while is happening.
Now imagine that the one hour of downtime happened when you were asleep. You've missed nothing.
There are two scenarios where this news is important: if your business depends on Twitter, and if you are trying to assess the reliability of Twitter. The latter can be achieved by #2 above, only the former needs real-time updates and that doesn't mean general news reporting just your own monitoring.
It seems like your perspective is that HN exists for the links it points to. I read HN for the discussion, and this thread has many good discussions in it.
This is more like a large manufacturer having a power outage. A fire would destroy machinery or stock. Nothing's getting lost here, something's just unavailable.
But a large manufacturer having a power outage, for, an hour like we are here, would cause a great deal of trouble with anything downstream of that manufacturer. And I'd be willing to bet that industry sources, mailing lists, forums, (whatever the manufacturing equivalent of HN is) would be discussing the outage, how bad it is, how this is going to completely screw up profits, how irritated they are, etc, etc.
You don't think GM or Ford have ever lost power for an hour? Did you ever read about it? Did the car dealerships suddenly run out of cars for one hour three weeks later? I think you are vastly overestimating the impact transient events like this have.
And I think you are vastly underestimating the percieved impact by the people involved, which was kind of my point.
Both Twitter going down for an hour and a plant that assembles cars going down for an hour aren't that big of a deal in the huge scheme of things, but for people who are intimately connected (either work there, know someone who does, are emotionally connected to the product in some fashion, etc), it feels a lot bigger than it is.
it's called hacker news. but for starters, "hacker" implies not being spineless or giving a fuck about peer groups. hackers also don't like censorship; so you see how hacker news is anything but. it's an involuntary joke, nothing more.
the word has been appropiated by those who are needy like that: you don't call yourself hacker, just like you don't call yourself saint. only mediocre people to whom it never applied and never will would do that. the end.
That's a good blog post. But "x is down" is newsworthy for sufficiently large number of users of x and sufficently long downtime. That's because consumer experience with this or that online service influences the online service's reputation. A service with few users, or users who have lower expectations, can endure more downtime without loss of reputation than a service with many users who expect the service simply always to be available, at least as available as broadcast television or plain-old telephone service.
also, "us in the media", what kind of whore talk is that even? twitter is correctly referrerd to in w3c docs as medium preventing intelligent discussion. so it was down? GOOD. people are inconvenienced? even better! it cannot possibly have hit anyone or anything that was worth fuck all.
I'm not sure why you suggest RSS is somehow synonymous with Twitter, but I will say that in addition to RSS buttons almost every major media company on the planet has a Twitter button on its article page (I work on one, which is why I say "us in media"). Because many sites don't do proper async JS when it comes to social buttons, an outage on socnets can be crippling
This deserves to be the top comment. Your one liner nailed it. Twitter was down long enough for far more people not to notice than did notice. Shit goes down. It always will. Whining about how whoever needs backups or failover protection or distributed networks of servers across the planet or should use a VPS instead or a dedicated server instead or Heroku instead or EC2 instead or a combination of all that crap doesn't make you right. It makes you a speculator. No amount of fallbacks will give you 100% uptime ever. And calling this a massive failure is also ludicrous. It's just some downtime. It went right back up so chill.
These posts are so incredibly annoying. We can see if service x is down for ourselves. That isn't news. I could maybe accept these stories if the link on the front page was to a blog post stating that not only is service x down but why it went down for sure plus an added lesson we can learn from it. Short of that it's become an easy way for people to build up a trillion karma points. And if you want to tell me you don't care about karma then you're either lying or you have none. Enough with this crap. We'll find out ourselves but most of us won't actually because we have lives and by the time we go online to check our favorite wank-off site it'll probably be back up again like the past fifty times I've seen a story about Heroku/AWS/Twitter being down.
Conversely, this has always been my complaint against the general hoopla and bandwagonning around Twitter -- they didn't do anything technically innovative. When they first started generating buzz, I instantly thought, "Oh, did they create a new messaging protocol that is universally accessible, redundant and easy to use? That's killer!" But, that was not the case.
Yes, they had a good idea and executed very well, but as I see it, Twitter is nothing more than your run-of-the-mill 4chan board. I still don't understand the draw, but then again, there's a lot of facets of modern society that I simply have no explanation for (reality tv?!?) and have been better off not worrying about it further.
99,9% of people doesn't care at all about how technically innovative something is, they only care about how useful it is. And Twitter is incredibly useful.
"Incredibly useful" would imply that if you told me about Twitter, I wouldn't believe you. As much as I wish it were otherwise, it's not that hard to believe that someone invented IRC over HTTP and gave it a silly name.
Many people, myself included, were skeptical of the value of Twitter until they used it. In that sense, we literally did not believe how useful Twitter was. Also, Twitter is almost nothing like IRC in any respect other than being an electronic communication medium.
Oh, did they create a new messaging protocol that is universally accessible, redundant and easy to use?
Something the vast majority of people couldn't care less about. Twitter has, in effect, created a new messaging protocol, in the broadest sense of it. It's accessible on your computer, your phone, even your TV if you try hard enough. It's integrated with hundreds of apps and sites. Technically speaking it isn't doing anything particularly amazing (although the sheer scale they deal with is), but that's not really the point.
A massively decentralised protocol would not exactly be the model of reliability, either. I'm not even sure how that could possibly work in a service like Twitter.
It would probably work like an IRC network. A network might involve hundreds of servers, but you only notice which server another person is using when there's a netsplit. Split users can reconnect using another server and rejoin the channel with only a slight interruption.
An IRC-styled Twitter would need some sort of synchronization service to handle twitsplits.
What is great about Twitter is that it allows both the sender and the receiver to choose if they want to interact over the web, through an app, via SMS or even email (receiving DMs). Take the lowest common denominator across all platforms (no subejct, max 160 chars in SMS) and sit in the middle as an abstraction.
"Twitter is down" means something very different from "Some people are having trouble accessing Twitter." My own traceroute results during the outage pointed to network problems, and I saw people saying they could access it.
Sounds like you just invented a new product: XIsDown.com, where not only do you get notified that X is down, you get to whine about it with other people.
I'm glad this made it to the front page. Is the topic itself newsworthy? Not on its own. Is all the discussion that's flooding into this thread worth having?
Yep. Even the subthread from the person complaining that this isn't newsworthy.
If you're debugging webservices that suddenly slow down (timeouts of 10s), this may be your cause if they depend on s.twitter.com, search.twitter.com or api.twitter.com.
As a workaround for those systems, add s.twitter.com, search.twitter.com and api.twitter.com in your /etc/hosts file that map back to 127.0.0.1.
This obviously breaks Twitter integration, but it also makes sure page loads don't explode when waiting for remote resources.
Have just had to add timeouts to a few requests to the twitter API. Was completely hanging a whole site.
Learn't something though, to never trust an API to respond in a reasonable amount of time!
Why is this on the HN front page? This is an entirely worthless post. It adds no value. Nobody is going to reread this at any point in the future. Utterly worthless.
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?