This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.
We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.
Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.
The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely. So companies who do their own IT would be routinely outcompeted by ones that outsource, only for the latter to get into trouble when the black swan swoops in. The problem is all other kinds of companies are mostly extinct by then unless their investors had some super-human foresight and discipline to invest for years into something that year after year looks like losing money.
> The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely.
Is that because of economies of scale or because the vendor is just cutting costs while hiding their negligence?
I don't understand how a single vendor was able to deploy an update to all of these systems virtually simultaneously, and _that_ wasn't identified as a risk. This smells of mindless box checking rather than sincere risk assessment and security auditing.
Kinda both I think, with an addition of principal agent problem. If you found a formula that provides the client with an acceptable CYA picture it is very scalable. And the model of "IT person knowledgeable in both security, modern threats and company's business" is not very scalable. The former, as we now know, is prone to catastrophic failures, but those are rare enough for a particular decision-maker to not be bothered by it.
Depressing thought that this phenomena is some kind of Nash equilibrium. That in the space of competition between firms, the equilibrium is for companies to outsource IT labor, saving on IT costs and passing that cost savings onto whatever service they are providing. -> Firms that outsource, out-compete their competition + expose their services to black swan catastrophic risk.
Is regulation that only way out of this, from a game theory perspective?
The whole market in which crowdstrike can exist is a result of regulation, albeit bad regulation.
And since the returns of selling endpoint protection are increasing with volume, the market can, over time, only be an oligopoly or monopoly.
It is a screwed market with artificially increased demand.
Also the outsourcing is not only about cost and compliance. There is at least a third force. In a situation like this, no CTO who bought crowdstrike products will be blamed. He did what was considered best industry practice (box ticking approach to security). From their perspective it is risk mitigation.
In theory, since most of the security incidents (not this one) involve the loss of personal customer data, if end customers would be willing to a pay a premium for proper handling of their data, AND if firms that don’t outsource and instead pay for competent administrators within their hierarchy had a means of signaling that, the equilibrium could be pushed to where you would like it to be.
Those are two very questionable ifs.
Also how do you recognise a competent administrator (even IT companies have problems with that), and how many are available in your area (you want them to live in the vicinity) even if you are willing to pay them like the most senior devs?
If you want to regulate the problem away, a lot of influencing factors have to be considered.
Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.
I was wondering when someone would bring up Taleb RE: this incident.
I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.
That is interesting, where does he talk about this? I'm curious to hear his reasoning. What I remember from the Black Swan is that Black Swan events are (1) rare, (2) have a non-linear/massive impact, (3) and easy to predict retrospectively. That is, a lot of people will say "of course that happened" after the fact but were never too concerned about it beforehand.
Apart from a few doomsdayers I am not aware of anybody was warning us about a crowd strike type of event. I do not know much about public health but it was my understanding that there were playbooks for an epidemic.
Even if we had a proper playbook (and we likely do), the failure is so distributed that one would need a lot of books and a lot of incident commanders to fix the problem. We are dead in the water.
I think it was "predicted" by Sunburst, the Solarwinds hack.
I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.
JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.
Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.
Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.
In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.
You can make a system tolerant to certain faults. Other faults are left "untolerated".
A system that can tolerate anything, so have perfect availability, seems clearly impossible. So yeah, totally right, it's always a tradeoff. That's reasonable, as long as you trade smart.
I wonder if the people deciding to install Crowdstrike are aware of this. If they traded intentionally, and this is something they accepted, I guess it's fine. If not... I further wonder if they will change anything in the aftermath.
There will be lawsuits, there will be negotiations for better contracts, and likely there will be processes put in place to make it look like something was done at a deeper level. And yet this will happen again next year or the year after, at another company. I would be surprised if there was a risk assessment for the software that is supposed to be the answer to the risk assessment in the first place. Will be interesting to see what happens once the dust settles.
- This is system has a single point of failure, it is not fault tolerant. Lets introduce these three things to make it fault-tolerant
- Now you have three single points of failure...
It really depends on the size of the system and the definition of fault tolerance. If I have a website calling out to 10 APIs and one API failure takes down the site, that is not fault tolerance. If that 1 API failure gets caught and the rest operate as normal, that is fault tolerance, but 10% of the system is down. If you go to almost any site and open the dev console, you'll see errors coming from parts of the system, that is fault tolerance. Any twin engine airplane is fault tolerant...until both engines fail. I would say the solar system is fault tolerant, the universe even moreso if you consider it a system.
tldr there are levels to fault tolerance and I understand what you are saying. I am not sure if you are advocating for getting rid of fault handling, but generally you can mitigate the big scary monsters and what is left is the really edge case issues, and there really is no stopping one of those from time to time given we live in a world where anything can happen at anytime.
This instance really seems like a human related error around deployment standards...and humans will always make mistakes.
well, you usually put a load balancer and multiple instances of your service to handle individual server failures. In a basic no-lb case, your single server fails, you restart it and move on (local failure). In a load balancer case, your lb introduces its own global risks e.g. the load balancer can itself fail, which you can restart, but the load balancer can have a bug and stop handling sticky sessions when your servers are relying on it, and now you have a much harder to track brown-out event that is affecting every one of your users for a longer time, it's hard to diagnose, might end up with hard to fix data issues and transactions, and restarting the whole might not be enough.
So yeah, there is no fault tolerance if the timeframe is large enough, there are just less events, with much higher costs. It's a tradeoff.
The cynical in me thinks that the one advantage of these complex CYA systems is that when systems fail catastrophically like CrowdStrike did, we can all "outsource" the blame to them.
It's also in line with arguments made by Ted Kaczynski (the Unabomber)
> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.
crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.
I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.
There was a quote last year during the "Twitter files" hearing, something like, "it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly".
Perhaps ironically, I had a difficult time using Google to find the exact wording of the quote or its source. The only verbatim result was from a NYPost article about the hearing.
>I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.
Be realistic, none his ideas would be blacklisted. They sound good on paper, but the instant it's time for everyone to return to mudhuts and farming, 99% of people will return to Playstations and ACs.
He wasn't "silenced" because the government was out to get him, no one talks about his ideas because they are just bad. Most people will give up on ecofascism once you tell them that you won't be able to eat strawberries out of season.
"would be blacklisted, deplatformed, or deamplified by consolidated authorities"
Sorry. Not true. You have Black Swan (Taleb) and Drift into Failure (Dekker) among many other books. These ideas are very well known to anyone who makes the effort.
The only thing that got Unabomber blacklisted is that he started to send bombs to people. His manifesto was dime a dozen, half the time you can expect politician boosting such stuff for temporary polling wins.
Hell, if we take his alleged (don't have vetted the genealogy tree) cousins, his body count isn't even that impressive.
I think a surprising amount of people already share this view, even if they don't go into extensive treatment with references like Dekker presumably does (I haven't read it).
I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.
You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.
Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.
> if you ever thought we could make systems fault tolerant
The only possible way to fault tolerancy is simplicity and then more simplicity.
Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.
As an architect of secure, real-time systems, the hardest lesson I had to learn is there's no such thing as a secure, real-time system in the absolute sense. Don't tell my boss.
I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".