I think this is another good example of how we as an industry are still unable to adequately assess risk properly.
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Whoever was in charge of disaster recovery obviously didn't really understand the risk.
That's not necessarily true. People don't die when twitter is down, and whatever twitter's business model actually is, I am not even sure there is a monetary penalty to them being down (unlike, say, Amazon being down which results in lost orders). They may have made the calculation that it was not cost effective engineering-wise to chase that extra 0.001% of reliability.
[Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
> [Edit: Pedantry shield: Ok, ok, should have said people don't die because twitter is down. Obviously people are dying all the time, and some will indeed expire while twitter is down].
And here I was hoping we could just take down Twitter and live forever...
This is going to lead to a loss of confidence amongst many - investors, advertisers, partners - who thought that Twitter was finally past its initial scaling/teething issues. It's not like there is anyone that hasn't heard of Twitter at this point in time. So yes, yes there can be bad publicity once something reaches the scale of twitter.
An e-commerce site being down does not lead to exactly the amount of orders lost as is average for that time. Most people will try again later, with the possible exception of first time buyers and likely exceptin of impatient commodity buyers with alternative accounts.
I would argue that people can die when Twitter goes down. In Mexico citizens are using Twitter to alert each other when Drug War violence erupts: readwriteweb.com/archives/shouting_fire_in_a_crowded_hashtag_narcocensorship.php
Maybe they need to tackle the underlying drug problem, rather than twitter fix it's downtime problem? The world revolved before twitter, and it will continue to revolve long afterwards.
I don't know the internal workings of Twitter, but if it isn't significantly easier to improve its downtime than to fix Mexico's drug violence, I'd argue they've made some non-trivial engineering mistakes somewhere.
"I think this is another good example of how we as an industry are still unable to adequately assess risk properly."
It is likely that what you mean by "properly" is impossible. At large enough scales, what you end up with is a Gaussian distribution of errors in accordance with the Central Limit Theorem... except that there's a Black Swan spike in the low-probability, high-consequence events, and you basically can't spend enough money to ever get rid of them. Ever. Even if you try, you just end up piling equipment and people and procedures which will, themselves, create the black swan when they fail.
I think you're trying to imply that if only they'd understood better, this could absolutely have been prevented. No. Some specific action would probably have been able to avert this but you simply don't have a 100% chance of calling those actions in advance, no matter how good you are.
The state space of these systems is incomprehensibly enormous and there is no feasible way in which you can get all the failures out of it, neither in theory nor in practice.
Living in terror of the absolute certainty of eventual failure is left as an exercise for the reader.
I didn't imply at all that all failures can be prevented. I'm saying that most peoples' assessment of risk are usually wrong. And the occurrence of a single point failure that can take down an entire system that is deemed low-risk seems to happen an awful lot.
It not only occurs in the technology industry, but also even in things like financial risk analysis. For example, people could mitigate the risk of a bond defaulting by buying a credit default swap. However, most people failed to assess the risk of their counter-party going belly up, like AIG or Lehman. This failure in risk assessment is in large part why the financial crisis was so widespread.
Another striking example is the failure of Fukushima I and II. Basically what happend was that they had a power failure (of the external line)! They thought that was somehow too unlikely to account for, which I really have trouble with understanding. Isn't it obvious that multiple systems can fail because of some unaccounted for event external event? One that affects both, on-site and external power? And in the Japanese case, the trigger was not even something you'd need much fantasy for, an earthquake. Japan is sitting directly on one of the biggest geo faults on earth, and they don't account for simultaneous power outage!
So, yes, many if not most people are very poor risk assessors.
Then again, this might be rather a capacity (or the lack thereof) of the organization within the risk is assessed. This is what the software engineering quip "Most problems are people problems" means, I think. In some environments it is hard to bring up the unlikely, catastrophic scenarions without being seen as overly pessimistic and somehow not enough subscribed to the success of the undertaking as a whole. So you assess risk success overly optimistic in order to further your career, not to assess risk accurately.
Pedantic footnote: the Fukushima Daichi plant failure wasn't just a power failure: they had multiple backup diesel generators and batteries, and the diesels kicked in after the earthquake hit and the reactors tripped and they lost the grid connection. The problem was that all the diesel generators and fuel were at ground level and the sea wall wasn't high enough to keep the tsunami from flooding them a few minutes later.
If they'd had a couple of gennies on the rooftops, they ... well, they wouldn't have been fine but they'd have had a fighting chance to keep the scrammed reactors from melting down. Or if they'd had a higher sea wall (like Onagawa) they'd have been fine.
So: not one, not two, but four power systems failed in order to result in the meltdowns -- and one of them would have worked if they had been located slightly differently.
(Otherwise, your point about people being poor risk assessors is spot-on. And worse: even if some people are acutely conscious of risk, once decision-making responsibility devolves to a committee, the risk-aware folks may be overruled by those who Just Don't See The Problem.)
Yes, I know the full scenario was a bit more involved.
My main beef with their failure handling is actually this: you need to be able to face a situation where _all_ you smart emergency systems fail. In the case of an NPP this can mean an almost global environmental crisis, and need relocate millions of people, making hundreds of square kilometers uninhabitable, etc. In that case you don't really want to rely on five generators on some roof. Which may or may not work on that day.
And this is not something I make up here now. I remember discussing nuclear safety in high school, and the bottom line was: NPPs are ok, since they become uncritical when _everything_ fails, because the moderator rods slide down into the reactor vessel.
But after Fukushima I read, that actually the situation there, with that specific model, is different, unfortunately. Tough luck. Because that model still needs some cooling because the fully moderated reactor still produces 1% it's total energy, an that is enough to bring the reactor into an 'undefined' state, iirc. And it is easy to imagine what that means for a station that has just been struck by an earthquake anyway.
My whole point is: your comment makes it appear as if the security layers actually were plenty, and I would (respectfully, of course) disagree with that. I think it was poor.
That point is important: if NPPs aren't build safely, what is then built safely? My guess is: nothing.
So what to do? Design for failure. (Politically, technically, economically, can be applied everywhere.)
So what to do? Design for failure. (Politically, technically, economically, can be applied everywhere.)
Yup.
On a similar note, the French response to Fukushima Daichi is rather interesting (France relies on nuclear generation for over 80% of its electricity):
"The ASN has also come up with an elegant technical solution to get around the (universal) dilemma of how to protect a plant from external threats, such as natural disasters. The report recommends that all reactors, irrespective of their perceived vulnerability, should add a 'hard core' layer of safety systems, with control rooms, generators and pumps housed in bunkers able to withstand physical threats far beyond those that the plants themselves are designed to resist."
(And a mobile emergency force who can move in and stabilize a reactor after an unforseen catastrophic disaster that kills everyone on-site and destroys most of the safety systems.)
In other words, they now expect unpredictable Bad Things to happen and are trying to build a flexible framework for dealing with it, rather than simply relying on procedures for addressing the known problems.
": you need to be able to face a situation where _all_ you smart emergency systems fail. "
Totally agree with you here - Until recently, no nuclear power plan was designed such that it could survive failure of all their emergency systems. Hopefully, with the negative repercussions of Fukushima on the industry, engineers are rethinking their approach to Nuclear Power.
> I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Is this really a valid conclusion to come to at this point? I expect downtime in any service I operate. It's just how the world works. Does that mean I don't understand the risks and am misleading the board?
Any assessment of risk entails certain larger assumptions about the world, many of which often turn out to be mere guesses.
Consider all the prices that are set to their current levels b/c nobody expects the collapse of the US political system to occur. Yet there is a nonzero probability that it will occur.
On one hand this seems like an absurd example, yet it exemplifies the kind of blind spot we are prone to when assessing risk. We generally address all the risks we can directly control, then classify the rest as "systemic" which essentially means that we are not able to compute them so we're going to ignore them.
Yet many systems which we assume to be stable or predictable (governments, companies, markets, weather patterns, social trends, etc.) have unexpected aberrations now and then which can have very significant consequences. Since these tend to impact most companies equally, the market will converge on an equilibrium where no firms do anything to hedge against these things.
Do you want to pay extra bank fees so that your bank can hedge against the collapse of the US currency for your checking account? Probably not. Do you want to triple your hosting costs to hedge against a massive US power grid failure? Probably not. The same applies to asteroid risk and sudden ice age risk.
On the other hand, if you have lots of money saved, you may wish to hedge against the collapse of one currency or another, and if your business would end if you suffered a few hours of downtime, you might want to invest in massive amounts of redundancy.
Every morning when we all commute to work we risk death. Some exposure to systemic risk is considered acceptable, and part of the character of any person or business is the kind of risk exposure we tolerate day to day. A doctor working in an AIDS clinic risks needle sticks and HIV, a startup doubling its users each month risks downtime but also risks a cash flow crisis.
Your examples describe two entirely different systems. The failover of a software product is drastically different from the failover of a power system. Trying to map everything back to a common best practice under the category of "risk" seems like it would miss out on important intricacies.
Risk management is about determining how to identify risks, as such, it is applicable everywhere. However, much like security is applicable everywhere, securing Fort Knox is a very different endeavor than securing a web site.
That's not really fair, we as humans are bad at risk analysis. Bruce Schneier has written extensively on the role of cognitive biases et al in measuring perceived risk.
I've been looking for years for a good reference that says "human are bad at probability estimation", but this one doesn't seem to work: he's simply saying that we're very good at estimating regular risks (if I understand correctly).
You've never done DR have you? It's a business process with a cost and there are RTO's (recovery time objectives) and RPO's (recovery point objectives). Systems can and will go down. So long as the recovery meets the defined objectives, then DR has been performed correctly. There is a limited amount of money and resources that businesses can spend on DR, COOPs, etc. You should understand that.
One hour RTO and RPO will cost way more than 24 hour recovery. Edit... and the business managers decide how much they wish to spend on DR. It's a trade-off and anyone who has ever done it, understands that.
I'd guess the higher ups at Twitter have run the cost benefit in their head (and probably many spreadsheets) plenty of times, and in most cases spending your limited resource on disaster recovery preparation just isn't worth it. Their site being down does not qualify as a "disaster" - they'll be back up soon, then we'll all be tweeting away again within minutes.
"Disaster recovery" is a term. I'm not saying it's a disaster in the sense that some horrible thing has happened. Like most people said, it's maybe annoying to some of Twitter's most avid users but fairly innocuous.
The point is that at least in the case of Heroku and EC2 (I'm not sure what caused Twitter's outage yet), the causes of failures weren't something like a "16-sigma" event like a plane hitting an electrical tower (a tragic event that happened in Palo Alto a couple of years ago). They were things like insufficiently tested software and processes, and misconfigured devices. These things do not add cost, except maybe incrementally more man-hours in terms of testing and auditing of configurations. They are not million-dollar diesel units that require permits, etc.
My point is that if a device misconfiguration can take down EC2, or a single bad data in their data stream can cause massive failure, it means that the entire system is much more fragile than its been sold to everyone. If they didn't realize this was the case, it means that the risk of failure was a lot higher than they had assessed.
This is a great point, and this isn't just about Twitter but also about many other sites and services that seem to depend on it. It looks like a lot of people have created a distributed system version of dependency hell for themselves, where they rely on a multitude of third parties not to change behaviour or go down. Additionally, in many cases and perhaps perfectly legitimately from a cost-benefit perspective, the envisaged way to recover from this kind of problem is to assume that people can quickly and frantically hack their way out of it at short notice.
(expanding my reply) No risk assessment in the world will stop a cage monkey from tripping over a pile of 1Us and falling onto the big red button. Figure out what your pain threshold is and live with it.
I know of a chap, called Terry, who became known as Total Network Terry. TNT. Because he accidently plugged the wrong cable into the wrong jack in the wrong cable cupboard on the wrong floor.
Just finding it took days to sort out, apparently.
That's like when I unplugged what I thought was our T1 line, but it turned out it was a neighbor's, who happened to be an ISP, which was a new business direction we were just moving into. And they didn't believe it was an accident...
What I was trying to point out is that catastrophe is inevitable and risk assessment isn't a panacea. I'd rather invest money in proactive countermeasures to unknown risks than try to think of all the things that could go wrong (which you'll never have the money in your life to fix anyway).
But also to the thread parent's comment about Twitter execs not realizing there was a minor risk of catastrophic failure: Executives only care that their money-making baby keeps running. In the past i've seen execs demand that an engineer call them at 3AM if the production site goes down for more than 5 minutes... even though that call is pointless. I think they just assume there's no point in getting involved with the plan because the plan will never be perfect, but at least they can be aware of a problem so they can cover their asses and tell a higher-up that it's being worked on. At the end of the day, even the guys at the top don't really give a shit about the product, they just care about their paycheck.
not necessarily. even if the expected value of taking a risk is positive, taking the risk can be undesirable, e.g. due to large variance.
statistical distributions that include a risk aversion parameter exist precisely to model this kind of problem (unfortunately their names have slipped my mind, otherwise I'd provide links)
It is simply a matter of perceived value and cost benefit. Why would a cio spend millions on Dr when the probability of diaster is so minute that the risk manager cannot even calculate it? Ok there is a risk that a plane will hit the pdc. .00000002%. And ultimately will our business grind to a halt? Or can we use a manual workaround until backups recover to sdc and we capture data lost since last backup. I mean I have a hard time taking this sort of risk seriously unless I'm running dialysis machines and someones life is at risk.
> ... and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
Risk management consultants already exist! Many companies just choose to assess it internally, or the consultants themselves are inexperienced (often lacking practical experience).
I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?