"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."
This is where they admit that:
1. They deployed changes to their software directly to customer production machines;
2. They didn’t allow their clients any opportunity to test those changes before they took effect; and
3. This was cosmically stupid and they’re going to stop doing that.
Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.
Combined with this, presented as a change they could potentially make, it's a killer:
> Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
They weren't doing any test deployments at all before blasting the world with an update? Reckless.
Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.
That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.
Presumably you could roll out to 1% and report issues back to the vendor before the update was applied to the last 99%. So a headache but not "stop the world and reboot" levels of hassle.
Those eager would take it immediately, those conservative would wait (and be celebrated by C-suite later when SHTF). Still a much better scenario than what happened.
> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.
I have a similar feeling.
At the very least perhaps have an "A" and a "B" update channel, where "B" is x hours behind A. This way if, in an HA configuration, one side goes down there's time to deal with it while your B-side is still up.
> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.
Being chronically exposed may be the right call, in the same way that Roman cities didn't have walls.
> So for instance if you run a ransomware business and shut down, like, a marketing agency or a dating app or a cryptocurrency exchange until it pays you a ransom in Bitcoin, that’s great, that’s good money. A crime, sure, but good money. But if you shut down the biggest oil pipeline in the U.S. for days, that’s dangerous, that’s a U.S. national security issue, that gets you too much attention and runs the risk of blowing up your whole business. So:
>> In its own statement, the DarkSide group hinted that an affiliate may have been behind the attack and that it never intended to cause such upheaval.
>> In a message posted on the dark web, where DarkSide maintains a site, the group suggested one of its customers was behind the attack and promised to do a better job vetting them going forward.
>> “We are apolitical. We do not participate in geopolitics,” the message says. “Our goal is to make money and not creating problems for society. From today, we introduce moderation and check each company that our partners want to encrypt to avoid social consequences in the future.”
> If you want to use their ransomware software to do crimes, apparently you have to submit a resume demonstrating that you are good at committing crimes. (“Hopeful affiliates are subject to DarkSide’s rigorous vetting process, which examines the candidate’s ‘work history,’ areas of expertise, and past profits among other things.”) But not too good! The goal is to bring a midsize company to its knees and extract a large ransom, not to bring society to its knees and extract terrible vengeance.
> We have talked about this before, and one category of crime that a ransomware compliance officer might reject is “hacks that are so big and disastrous that they could call down the wrath of the US government and shut down the whole business.” But another category of off-limits crime appears to be “hacks that are so morally reprehensible that they will lead to other criminals boycotting your business.”
>> A global ransomware operator issued an apology and offered to unlock the data targeted in a ransomware attack on Toronto’s Hospital for Sick Children, a move cybersecurity experts say is rare, if not unprecedented, for the infamous group.
>> LockBit’s apology, meanwhile, appears to be a way of managing its image, said [cybersecurity researcher Chester] Wisniewski.
>> He suggested the move could be directed at those partners who might see the attack on a children’s hospital as a step too far.
> If you are one of the providers, you have to choose your hacker partners carefully so that they do the right amount of crime: You don’t want incompetent or unambitious hackers who can’t make any money, but you also don’t want overly ambitious hackers who hack, you know, the US Department of Defense, or the Hospital for Sick Children. Meanwhile you also have to market yourself to hacker partners so that they choose your services, which again requires that you have a reputation for being good and bold at crime, but not too bold. Your hacker partners want to do crime, but they have their limits, and if you get a reputation for murdering sick children that will cost you some criminal business.
> I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.
Absolutely this is what will happen.
I don't know much about the practice of AV definition-like feature across Cybersecurity but I would imagine there might be a possibility that no vendors do rolling update today because it involves Opt-in/Opt-out which might influence the vendor's speed to identify attack which in turns affect their "Reputation" as well.
"I bought Vendor-A solution but I got hacked and have to pay Ransomware" (with a side note: because I did not consume the latest critical update of AV definition) is what Vendors worried.
Now that this Global Outage happened, it will change the landscape a bit.
> If you don't send them fast to your customer and your customer gets compromised, your reputation gets hit.
> If you send them fast, this BSOD happened.
> It's more like damn if you do, damn if you don't.
What about notifications? If someone has an update policy that disable auto-updates to a critical piece of infrastructure, you can still let him know that there's a critical update is available. Now, he can do follow his own checklist in order to ensure everything goes well.
Okay, but who has more domain knowledge when to deploy? A "security expert" that created the "security product" that operates with root privileges and full telemetry, or IT staff member that looked at said "security expert" value proposition and didn't have issue with it.
Honestly, this reads as a suggestion that even more blame ought to be shifted to the customer.
> They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.
Is it really all that surprising? This is basically their business model - its a fancy virus scanner that is supposed to instantly respond to threats.
> They didn’t allow their clients any opportunity to test those changes before they took effect
I’d argue that anyone that agrees to this is the idiot. Sure they have blame for being the source of the problem, but any CXO that signed off on software that a third party can update whenever they’d like is also at fault. It’s not an “if” situation, it’s a “when”.
I felt exactly the same when I read about the outage. What kind of CTO would allow 3rd party "security" software to automatically update? That's just crazy. Of course, your own security team would do some careful (canary-like) upgrades locally... run for a bit... run some tests, then sign-off. Then upgrade in a staged manner.
This is a great point that I never considered. Many companies subscribing to CrowdStrike services probably thought they took a shortcut to completely outsource they cyber-security needs. Oops, that was a mistake.
>I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.
If indeed this happens, I'd hail this event as a victory overall; but industry experience tells me that most of those companies will say "it'd never happen with us, we're a lot more careful", and keep doing what they're doing.
I really wish we would get some regulation as a result of this. I know people that almost died due to hospitals being down. It should be absolutely mandatory for users, IT departments, etc. to be able to control when and where updates happen on their infrastructure but *especially* so for critical infrastructure.
But canary / smoke tests, you can do, if the vendor provides the right tools.
It's a cycle: pick the latest release, do some small cluster testing, including rollback testing, then roll out to 1%, if those machines are (mostly) still available in 5 minutes, roll out to 2%, if the 3% is (mostly) still available in 5 minutes, roll out to 4%, etc. If updates are fast and everything works, it goes quick. If there's a big problem, you'll have still have a lot of working nodes. If there's a small problem, you have a small problem.
It's gotta be automated though, but with an easy way for a person to pause if something is going wrong that the automation doesn't catch. If the pace is several updates a day, that's too much for people, IMHO.
Which EDR vendor provides a mechanism for testing virus signatures? This is the first time I'm hearing it and I'd like to learn more to close that knowledge gap. I always thought they are all updated ASAP, no exceptions.
Microsoft Defender isn't the most sophisticated EDR out there, but you can manage its updates with WSUS. It's been a long time since I've been subject to a corporate imposed EDR or similar, but I seem to recall them pulling updates from a company owned server for bandwidth savings, if nothing else. You can trickle update those with network controls even if the vendor doesn't provide proper tools.
If corporate can't figure out how to manage software updates on their managed systems, the EDR software is the command and control malware the EDR software is supposed to prevent.
I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.
The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.
I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.
In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.
In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.
yeah, I don't get the "we couldn't have tested it" crap, because "something happens to the payload after we tested it". Create a fake downstream company and put a bunch of machines in it. That's your final test before releasing to the rest of the world.
Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
> Enhance existing error handling in the Content Interpreter.
That's it.
Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.
> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Could it say any less? I hope the new check is a test fleet.
But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
> it sounds like they might have separate "validation" code
That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."
Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.
I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?
Maybe they have the same parser in the validator and the real driver, but the vagaries of the C language mean that when undefined behavior is encountered, it may crash or it may work just by chance.
I understand what you're saying. But ~8.5 million machines in 78 minutes isn't a fluke caused by undefined behavior. All signs so far indicate that they would have caught this if they'd had even a modest test fleet. Setting aside the ways they could have prevented it before it reaching that point.
That's besides the point. Of course they need a test fleet. But in the absence of that, there's a very real chance that the existing bug triggered on customer machines but not their validator. This thread is speculating on the reason why their existing validation didn't catch this issue.
> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"
That stood out to me as well.
Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.
This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.
That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.
And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?
Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.
If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)
I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.
* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.
No matter what sort of static validation they attempt, they're still risking other unanticipated effects. They could stumble upon a bug in the OS or some driver, they could cause false positives, they could trigger logspew or other excessive resource usage.
Failure can happen in strange ways. When in a position as sensitive as deploying software to far-flung machines in arbitrary environments, they need to be paranoid about those failure modes. Excuses aren't enough.
Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.
> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.
It compiled, so they shipped it to everyone all at once without ever running it themselves.
Architects likely do not have a choice. These things are driven by auditors and requirements for things like insurance or PCI and it’s expensive to protest those. I know people who’ve gone full serverless just to lop off the branches of the audit tree about general purpose server operating systems, and now I’m wondering whether anyone is thinking about iOS/ChromeOS for the same reason.
The more successful path here is probably demanding proof of a decent SDLC, use of memory-safe languages, etc. in contract language.
Architects don't have a choice, CTO are well paid to golf with the CEO and delegate to their teams, Auditors just audit but are not involved with the
technical implementations, Developers just develop according to the Spec, and
Security team just are a pain in the ass. Nobody owns it...
Everybody get's well paid, and at the end we have to get lessons learned...It's a s*&^&t show...
Some industries are forced by regulation or liability to have something like crowdstrike deployed on their systems. And crowdstrike doesn't have a lot of alternatives that tick as many checkboxes and are as widely recognized.
PCI DSS v4.0 Requirements 5 and 6 speaks very broadly for anti-malware controls, which Crowdstrike provides as EDR, and cybersecurity (liability, ransomware, etc) insurance absolutely requires it from the questionnaires I’ve completed and am required to attest to.
> In its first version, PCI DSS included controls for detecting, removing, blocking, and containing malicious code (malware). Until version 3.2.1, these controls were generically referred to as "anti-virus software", which was incorrect technically because they protect not just against viruses, but also against other known malware variants (worms, trojans, ransomware, spyware, rootkits, adware, backdoors, etc.). As a result, the term "antimalware" is now used not only to refer to viruses, but also to all other types of malicious code, more in line with the requirement's objectives.
> To avoid the ambiguities seen in previous versions of the standard about which operating systems should have an anti-malware solution installed and which should not, a more operational approach has been chosen: the entity should perform a periodic assessment to determine which system components should require an anti-malware solution. All other assets that are determined not to be affected by malware should be included in a list (req. 5.2.3).
> Updates of the anti-malware solution must be performed automatically (req. 5.3.1).
> Finally, the term "real-time scanning" is explicitly included for the anti-malware solution (this is a type of persistent, continuous scanning where a scan for security risks is performed every time a file is received, opened, downloaded, copied or modified). Previously, there was a reference to the fact that anti-malware mechanisms should be actively running, which gave rise to different interpretations.
> Continuous behavioral analysis of systems or processes is incorporated as an accepted anti-malware solution scanning method, as an alternative to traditional periodic (scheduled and on-demand) and real-time (on-access) scans (req. 5.3.2).
Besides things like FedRAMP mentioned in other comments, some large enterprise customers, especially banks, require terms in the contract stating the vendor uses some form of anti-malware software.
They don't care, CI/CD, like QA, is considered a cost center for some of these companies. The cheapest thing for them is to offload the burden of testing every configuration onto the developer, who is also going to be tasked with shipping as quickly as possible or getting canned.
Claw back executive pay, stock, and bonuses imo and you'll see funded QA and CI teams.
It sure sounds like the "Content Validator" they mention is a form of CI/CD. The problem is that it passed that validation, but was capable of failing in reality.
The content validator is a form of validation done in CI. Their CD pipeline is the bigger problem here: it was extremely reckless given the system it was used in (configuring millions of customer machines in unknown environments). A CD pipeline for a tiny startup's email service can just deploy straight away. Crowdstrike (as they finally realized) need a CD pipeline with much more rigorous validation.
This also becomes a security issue at some point. If these updates can go in untested, what's to stop a rogue employee from deliberately pushing a malicious update?
I know insider threats are very hard to protect against in general but these companies must be the most juicy target for state actors. Imagine what you could do with kernel space code in emergency services, transport infrastructure and banks.
CrowdStrike is more than big enough to have a real 2000’s-style QA team. There should be actual people with actual computers whose job is to break the software and write bug reports. Nothing is deployed without QA sign off, and no one is permitted to apply pressure to QA to sign off on anything. CI/CD is simply not sufficient for a product that can fail in a non-revertable way.
A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.
Besides missing the actual testing (!), the staged rollout (!), looks like they also weren't fuzzing this kernel driver that routinely takes instant worldwide updates. Oops.
This feels natural, though: the first time you do something you do it 10x more slowly because there's a lot more risk. Continuing to do things like that forever isn't realistic. Complacency is a double-edged sword: sometimes it gets us to avoid wasting time and energy on needless worry (the first time someone drives a car they go 5 mph and brake at anything surprising), sometimes it gets us to be too reckless (drivers forgetting to check blind spots or driving at dangerous speeds).
> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.
I don't read it as _bypassing tests_. They have tested the interpreter (`template type`) when it was first released, and they have _validated_ the new template instance (via `content validator`) and assumed this is enough, because it was enough in the past. None of the steps in the usual process were bypassed, and everything was done by the (their) book.
But it looks to me there's no integration test in the process at all. They're effectively unit testing the interpreter (template type), unit testing (validating) the "code" (template instance), but their testing strategy never actually runs the code on the interpreter (or, executes the template instance against the template type).
You can't bypass the tests if you don't have them? <insert meme here>
They don't even bother to do the most simple smoke test ever of running their software on a vanilla configuration and remind me again because I have trouble understanding what we're exactly trying to argue here.
In my experience with outages, usually the problem lies in some human error not following the process: Someone didn't do something, checks weren't performed, code reviews were skipped, someone got lazy.
In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?
They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?
> In my experience with outages, usually the problem lies in some human error not following the process
Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.
> what kind of bug? Could it have been prevented with proper testing or code review?
It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.
Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.
More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"
> ...what was the process in place and why did it fail?
It appears the process was:
1. Channel files are considered trusted; so no need to sanity-check inputs in the sensor, and no need to fuzz the sensor itself to make sure it deals gracefully with corrupted channel files.
2. Channel files are trusted if they pass a Content Validator. No additional testing is needed; in particular, the channel files don't even need to be smoke-tested on a real system.
3. A Content Validator is considered 100% effective if it has been run on three previous batches of channel files without incident.
Now it's possible that there were prescribed steps in the process which were not followed; but those too are to be expected if there is no automation in place. A proper process requires some sort of explicit override to skip parts of it.
"Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."
So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.
As I understand, it is incredibly difficult to prove "gross negligence". It is better to pressure them to settle in a giant class action lawsuit. I am curious what the total amount of settlements / fines will be in the end. I guess ~2B USD.
Same here. Our losses were quite significant - between lost productivity, inability to provide services, inability of our clients to actually use contracted services, and having to fix their mess - its very easily in the millions.
And then there will be the costs of litigation. It was crazy in the IT department over the weekend, but not much less crazy in our legal teams, who were being bombarded with pitches from law firms offering help in recovery. It will be a fun space to watch, and this 'we haven't tested because we, like, did that before and nothing bad happened' statement in the initial report will be quoted in many lawsuits.
To be clear: I do not expect the settlement to bankrupt them, but I do expect it to be painful. And, when you say "easily in the millions" -- good luck to demonstrate that in a class action lawsuit, and have the judge believe you. It is much harder than people think. You will be lucky to recoup 10% of those expenses after a settlement. Also, your company may also have cyber-security insurance. (Yes, the insurance companies will join the class action lawsuit, but you cannot get blood from a stone. There will be limits about the settlement size.)
It’s endemic in the tech security industry - they’ve been mentally colonised by ex-mil and ex-law enforcement (wannabe mil) folks for a long time.
I try to use social work terms and principles in professional settings, which blows these people’s minds.
Advocacy, capacity evaluation, community engagement, cultural competencies, duty of care, ethics, evidence-based intervention, incentives, macro-, mezzo- and micro-practice, minimisation of harm, respect, self concept, self control etc etc
It means that my teams aren’t focussed on “nuking the bad guys from orbit” or whatever, but building defence in depth and indeed our own communities of practice (hah!), and using psychological and social lenses as well as tech and adversarial ones to predict, prevent and address disruptive and dangerous actors.
Even computer security itself is a metaphor (at least in its inception). I often wonder what if instead of using terms like access, key, illegal operation, firewall, etc. we'd instead chosen metaphors from a different domain, for example plumbing. I'm sure a plumbing metaphor could also be found for every computer security concern. Would be so quick to romanticize as well as militarize a field dealing with "leaks," "blockages," "illegal taps," and "water quality"?
The sensor isn't a host, machine, or a client. It's the software component that detects threats. I guess maybe you could call it an agent instead, but I think sensor is pretty accepted terminology in the EDR space - it's not specific to Crowdstrike.
because those things are different? i didn't see a single "military" jargon. there is absolutely nothing unusual about their wording. It's like someone saying "why do these people use such nerdy words" regarding HN content.
This reads like a bunch of baloney to obscure the real problem.
The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed. The architectural changes are the more interesting bits, and they're covered reasonably well. Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code. Your fourth one is a fair point: building in watchdogs of some sort to prevent a crashloop would be good. Also having a remote killswitch that can be checked before turning the sensor on would have helped in containing the damage of a crashloop. Your last one I feel like is mostly redundant with a lot of the follow-ups they did commit to.
It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.
>I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed.
I was not talking about the code that crashed.
I guess what I wrote was non-obvious enough that it needs an explanation:
— fixing whatever produced "problematic content":
The release doesn't talk about the subsystem that produced the "problematic content". The part that crashed was the interpreter (consumer of the content); the part that generated the "problematic content" might have worked as intended, for all we know.
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes:
I am not talking about fixing this particular crash.
I am talking about design choices that allow such crashes in principle.
In this instance, the interpreter seemed to have been reading memory addresses from a configuration file (or something that would be equivalent to doing that). Adding an additional check will fix this bug, but not the fundamental issue that an interpreter should not be doing that.
>The architectural changes are the more interesting bits, and they're covered reasonably well
They are not covered at all. Are we reading the same press release?
>Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code.
Yes, that's the problem I am pointing out: the "validator" and "interpreter" should be the same code. The "validator" can issue commands to a mock operating system instead of doing real API calls, but it should go through the input with the actual interpreter.
In other words, the interpreter should be a part of the validator.
>It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.
Sure; that's my subjective assessment. Personally, I am very dissatisfied with their post-mortem. If you are happy with it, that's fair, but you'd need to say more if you want to make a point in addition to "the architectural changes are covered reasonably well".
Like, which specific changes those would be, for starters.
>Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
>Enhance existing error handling in the Content Interpreter.
They did write that they intended to fix the bugs in both the validator and the interpreter. Though it's a big mystery to me and most of the comments on the topic how an interpreter that crashes on a null template would ever get into production.
>They did write that they intended to fix the bugs
I strongly disagree.
Add additional validation and enhance error handling say as much as "add band-aids and improve health" in response to a broken arm.
Which is not something you'd want to hear from a kindergarten that sends your kid back to you with shattered bones.
Note that the things I said were missing are indeed missing in the "mitigation".
In particular, additional checks and "enhanced" error handling don't address:
— the fact that it's possible for content to be "problematic" for interpreter, but not the validator;
— the possibility for "problematic" content to crash the entire system still remaining;
— nothing being said about what made the content "problematic" (spoiler: a bunch of zeros, but they didn't say it), how that content was produced in the first place, and the possibility of it happening in the future still remaining;
— the fact that their clients aren't in control of their own systems, have no way to roll back a bad update, and can have their entire fleet disabled or compromised by CrowdStrike in an instant;
— the business practices and incentives that didn't result in all their "mitigation" steps (as well as steps addressing the above) being already implemented still driving CrowdStrike's relationship with its employees and clients.
The latter is particularly important. This is less a software issue, and more an organizational failure.
Elsewhere on HN and reddit, people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability", make it practically impossible to release well-tested code, and that reliance on a rootkit for security is little more than CYA — which means that the writing was on the wall, and this will happen again.
You can't fix bad business practices with bug fixes and improved testing. And you can't fix what you don't look into.
Hence my qualification of this "review" as a red herring.
> people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability
I didn't see people explaining why this was ridiculous.
> make it practically impossible to release well-tested code
That falsely presumes the release must be code.
CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."
>I didn't see people explaining why this was ridiculous.
Because of how it affects priorities and incentives.
E.g.: as of 2024, CrowdStrike didn't implement staggered rollout of Rapid Response content. If you spend a second thinking why that's the case, you'll realize that rapid and staggered are literally antithetical.
>CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."
Well, they are lying.
The data that you feed into an interpreter is code, no matter what they want to call it.
> fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
Better not only fix this specific bug but continuously use fuzzing to find more places where external data (including updates) can trigger a crash (or worse RCE)
But it seems to me that putting the interpreter in a place in the OS where it can cause a system crash with the be the behavior that it's allowed to do is a fundamental design choice that is not at all addressed by fuzzing.
That includes a couple of bullet points under "Third Party Validation" (independent code/process reviews), which they added to the PIR on the hub page, but not on the dedicated PIR page.
Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.
Such a disingenuous review; waffle and distraction to hide the important bits (or rather bit: bug in content validator) behind a wall of text that few people are going to finish.
If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.
> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability
Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.
I disagree; it's much longer than it needs to be, is filled with pseudo-technoese to hide that there's little of consequence in there, and the tiny bit of real information in there is couched with distractions and unnecessary detail.
As I understand it, they're telling us that the outage was caused by an unspecified bug in the "Content Validator", and that the file that was shipped was done so without testing because it worked fine last time.
I think they wrote what they did because they couldn't publish the above directly without being rightly excoriated for it, and at least this way a lot of the people reading it won't understand what they're saying but it sounds very technical.
no, it's one of most well written PIR's I've seen. It establishes terms and procedures after communicating that this isn't an RCA, then they detail the timeline of tests and deployments done and what went wrong. They were not excessively verbose or terse. This is the right way of communicating to the intended audience. It is both technical people, executives and law makers alike that will be reading this. They communicated their findings clearly without code, screenshots, excessive historical details and other distractions.
If you think this is good, go look at a Cloudflare postmortem. The fly.io ones are good too.
Way less obscure language, way more detail and depth, actually owning the mistakes rather than vaguely waffling on. This write up from CrowdStrike is close to being functionally junk.
One of the first things they've stated is that this isn't an RCA (deep dive analysis) like cloudflare and fly.io's, that's not what this is. This is to brief customers and the public of their immediate post-mortem understanding of what happened. The standard for that is different than an RCA.
Do you see how they only talk about technical changes to prevent this from happening again?
To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?
They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.
Well I'm glad they at least released a public postmortem on the incident. To be honest, I feel naive saying this, but having worked at a bunch of startups my whole life, I expected companies like CrowdStrike to do better than not testing it on their own machines before deploying an update without the ability to roll it back.
One lesson I've learned from this fiasco is to examine my own self when it comes to these situations. I am so befuddled by all the wild opinions, speculations and conclusions as well as observations of the PIR here. You can never have enough humility.
If I had to guess blindly based on their writeup, it would seem that if their Content Configuration System is given invalid data, instead of aborting the template, it generates a null template.
To a degree it makes sense because it's not unusual for a template generator to provide a null response if given invalid inputs however the Content Validator then took that null and published it instead of handling the null case as it should have.
Returning null instead of throwing an exception when an error occurs is the quality of programming I see from junior outsourced developers.
“if (corrupt digital signature) return null;”
is the type of code I see buried in authentication systems, gleefully converting what should be a sudden stop into a shambling zombie of invalid state and null reference exceptions fifty pages of code later in some controller that’s already written to the database on behalf of an attacker.
If I peer into my crystal ball I see a vision of CrowdStrike error handling code quality that looks suspiciously the same.
(If I sound salty, it’s because I’ve been cleaning up their mess since last week.)
>Returning null instead of throwing an exception when an error occurs is the quality of programming I see from junior outsourced developers.
This is kernel code, most likely written in C (and regardless of language, you don't really do exceptions in the kernel at all for various reasons).
Returning NULL or ERR_PTR (in the case of linux) is absolutely one of the most standard, common, and enforced ways of indicating an error state in kernel code, across many OS's.
So it's no surprise to see the pattern here, as you would expect.
So this event is probably close to a worst case scenario for an untested sensor update. But have they never had issues with such untested updates before, like an update resulting in false positives on legitimate software? Because if they did, that should have been a clue that these types if updates should be tested too.
Crowdstrike issues false positives allll the time. They'll fix them and then they'll come back in a future update. One such false positive is an empty file. Crowdstike hates empty files.
I feel like for a system that is this widely used and installed in such a critical position that upon a BSOD crash due to a faulting kernel module like this, the system should be able to automatically roll back to try the previous version on subsequent boot(s).
I really dislike reading website that take over half the screen and make me read off to the side like this. I can fix it by zooming in but I don't understand why they thought making the navigation take up that much of the screen or not be collapsable was a good move.
>When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception.
Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.
"On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”"
I get that canary rollout is tricky in this business, since it's all about stopping the spread of viruses and attacks.
That said, this incident review doesn't mention numbers, unless I missed it; how colossal of a fuck up it was.
The reality is that they don't apologize "bad shit just happens", they work their engineers to the grave, make no apology and completely screw up. This reads like a minor bump in processes.
Crowdstrike engineered the biggest computer attack the world has ever seen, with a sole purpose of preventing those. They're slowly becoming the Oracle of security and I see no sign of improvement here.
Fun post, but I'll state the obvious because I think many people do believe that every Windows machine BSOD'd. It was only ones with Crowdstrike software. Which is apparently very common but isn't actually pre-installed by Microsoft in Windows, or anything like that.
Source: work in a Windows shop and had a normal day.
True, and definitely worth a mention. This is only Microsoft's fault insofar as it was possible at all to crash this way, this broadly, with so little recourse via remote tooling.
* Their software reads config files to determine which behavior to monitor/block
* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"
* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions
* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully
* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions
that's crazy. How costly can it be to test the file fully in a CI job? I fail to see how this wasn't implemented already.
> How costly can it be to test the file fully in a CI job?
It didn't need a CI job. It just needed one person to actually boot and run a Windows instance with the Crowdstrike software installed: a smoke test.
TFA is mostly an irrelevent discourse on the product architecture, stuffed with proprietary Crowdstrike jargon, with about a couple of paragraphs dedicated to the actual problem; and they don't mention the non-existence of a smoke test.
To me, TFA is not a signal that Crowdstrike has a plan to remediate the problem, yet.
You just got tricked by this dishonest article. The whole section that mentions dogfooding is only about actual updates to the kernel driver. This was not a kernel driver update, the entire section is irrelevant.
This was a "content file", and the first time it was interpreted by the kernel driver was when it was pushed to customer production systems worldwide. There was no testing of any sort.
It's worse than that -- if your strategy actually was to use the customer fleet as QA and monitoring, then it probably wouldn't take you an hour and a half to notice that the fleet was exploding and withdraw the update, as it did here. There was simply no QA anywhere.
> Why don't you just stand up and admit that you didn't bother testing everything you send to production?
The "What Happened on July 19, 2024?" section combined with the "Rapid Response Content Deployment" make it very clear to anyone reading that that is the case. Similarly, the discussion of the sensor release process in "Sensor Content" and lack of discussion of a release process in the "Rapid Response Content" section solidify the idea that they didn't consider validated rapid response content causing bad behavior as a thing to worry about.
It wasn't a file full of zeros that caused the problem.
While some affected users did have a file full of zeros, that was actually a result of the system in the process of trying to download an update, and not the version of the file that caused the crash.
Here is my summary with the marketing bullshit ripped out.
Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.
One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.
"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.
"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.
They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.
(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.
---
In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.
Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.
This would be the kind of vulnerability that would be worth millions of dollars and used for targeted attacks and/or by state actors. It could take years to uncover (like Pegasus, which took 5 years to be discovered) or never be uncovered at all.
Probably not, if you're implying remote code execution -- it was an out of bounds READ operation, not write, causing an immediate crash. Unlikely to be useful for anything other than taking systems offline (which can certainly be useful, but is not RCE).
It was a read operation during bytecode template initialization, in a driver that reads userland memory. An out of bound read operation to load code in a driver that maps user memory can easily lead to code execution and privilege escalation: if the attacker finds a way to get the out of bound read into memory they control, they could cause the driver to load a manufactured template and inject bytecode.
It's not clear that this specific vulnerability is exploitable, but it's exactly the kind of vulnerability that could be exploited for code execution.
> Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.
You would have to get into the supply chain to do much damage.
Otherwise, you would somehow need access to the hosts running the agent.
If you a threat-actor that already has access to hosts running CS, at a scale that would make the news, why would you blow your access on trying to ruin CS's reputation further?
Perhaps if you are a vendor of a competing or adjacent product that deploys an agent, you could deliberately try and crash the CS agent, but you would be caught.
Copying my content from the duplicate thread[1] here:
This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
> Copying my content from the duplicate thread[1] here
Please don't do this! It makes merging threads a pain because then we have to find the duplicate subthreads (i.e. your two comments) and merge the replies as well.
Instead, if you or anyone will let us know at hn@ycombinator.com which threads need merging, we can do that. The solution is deduplication, not further duplication!
Somehow, I never realized that duplicate threads were merged (instead of one of them being nuked), because it seems like a lot of work in the first place.
The thread is still wrong, since it was a OOB memory read, not a missing null pointer check as claimed. 0x9c is likely the value that just happened to be in the OOB read.
Not really, that thread showed only superficial knowledge and analysis, far from hitting the nail on the head, for anyone used to assembly/reverse engineering. Then goes on to make provably wrong assumptions and comments. There is actually a null check (2 even!) just before trying the memory access. The root cause is likely trying to access an address that's coming from some uninitialized or wrongly initialized or non-deterministically initialized array.
What it did well was explaining the basics nicely for a wide audience who knows nothing about a crash dump or invalid memory access, which I guess made the post popular. Good enough for a general public explanation, but doesn't pass the bar for an actual technical one to any useful degree.
"Incoming data triggered a out-of-bound memory access bug" is hardly a useful conclusion for a root cause investigation (even if you are of the faith of the single root cause).
How can these companies be certified and compliant, etc., and then in practice have horrible SDLC?
What was the impact of diverse teams (offshoring)? Often companies don’t have necessary checks to ensure disparateness of teams does not impact quality. Maybe it was zero or maybe it was more.
Standards generally don't mandate specifics and almost certainly nothing specific to SDLC. At least none I've heard of. Things like FIPS and ISO and SOC2 generally prescribe having a certain process, sometimes they can mandate some specifics (e.g. what ciphers for FIPS). Maybe there should be some release process standards that prescribe how this is done but I'm not aware of any. I think part of the problem is the standard bodies don't really know what to prescribe, this sort of has to come from the community. Maybe not unlike the historical development of other engineering professions. Today being compliant with FIPS doesn't really mean you're secure and being SOC2 compliant doesn't really mean customer data is safe etc. It's more some sort of minimal bar in certain areas of practice and process.
Sadly, I agree with your take. All it is is a minimum bar. Many who don't have the above are even worse --tho not necessarily, but as a rule probably yes.
No, but their release process should catch major bugs such as this. After internal QA, you release to small internal dev team, then to select members of other depts willing to dog-food it, then limited external partners then GA? Or something like that so that you have multiple opportunities to catch weird software/hardware interactions before bringing down business critical systems for major and small companies around the planet?
> After internal QA, you release to small internal dev team, then to select members of other depts willing to dog-food it, then limited external partners then GA
What about AV definition update for 0day swimming in the tubes right now?
Sure, those have happened before, but nothing with an impact like last weekend. That's inexcusable. At least definitions can update themselves out of trouble.
What do you refer to "those have happened before"?
Isn't that what happened? Not a software update, not an AV-definition update but more so an AV-definition "data" update. At least that's how I interpret "Rapid Response Content"
"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."
This is where they admit that:
1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.
Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.