Hacker News new | past | comments | ask | show | jobs | submit login
Preliminary Post Incident Review (crowdstrike.com)
200 points by cavilatrest 63 days ago | hide | past | favorite | 210 comments



There’s only one sentence that matters:

"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."

This is where they admit that:

1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.


Combined with this, presented as a change they could potentially make, it's a killer:

> Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

They weren't doing any test deployments at all before blasting the world with an update? Reckless.


> our staging environment, which consists of a variety of operating systems and workloads

they have a staging environment at least, but no idea what they were running in it or what testing was done there.


Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.


> That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.

If they'd just had a lab of a couple dozen PCs acting as canaries they'd have caught this. Apparently that was too complicated or expensive for them.


Why can't they just do it more like Microsoft security patches, making them mandatory but giving admins control over when they're deployed?


That would be equivalent to asking "would you prefer your fleet to bluescreen now, or later" in this case.


Presumably you could roll out to 1% and report issues back to the vendor before the update was applied to the last 99%. So a headache but not "stop the world and reboot" levels of hassle.


With the slight difference that you can stop applying the update once you notice the bluescreens


Those eager would take it immediately, those conservative would wait (and be celebrated by C-suite later when SHTF). Still a much better scenario than what happened.


> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

I have a similar feeling.

At the very least perhaps have an "A" and a "B" update channel, where "B" is x hours behind A. This way if, in an HA configuration, one side goes down there's time to deal with it while your B-side is still up.


> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

Being chronically exposed may be the right call, in the same way that Roman cities didn't have walls.

Compare this perspective from Matt Levine:

https://archive.is/4AvgO

> So for instance if you run a ransomware business and shut down, like, a marketing agency or a dating app or a cryptocurrency exchange until it pays you a ransom in Bitcoin, that’s great, that’s good money. A crime, sure, but good money. But if you shut down the biggest oil pipeline in the U.S. for days, that’s dangerous, that’s a U.S. national security issue, that gets you too much attention and runs the risk of blowing up your whole business. So:

>> In its own statement, the DarkSide group hinted that an affiliate may have been behind the attack and that it never intended to cause such upheaval.

>> In a message posted on the dark web, where DarkSide maintains a site, the group suggested one of its customers was behind the attack and promised to do a better job vetting them going forward.

>> “We are apolitical. We do not participate in geopolitics,” the message says. “Our goal is to make money and not creating problems for society. From today, we introduce moderation and check each company that our partners want to encrypt to avoid social consequences in the future.”

> If you want to use their ransomware software to do crimes, apparently you have to submit a resume demonstrating that you are good at committing crimes. (“Hopeful affiliates are subject to DarkSide’s rigorous vetting process, which examines the candidate’s ‘work history,’ areas of expertise, and past profits among other things.”) But not too good! The goal is to bring a midsize company to its knees and extract a large ransom, not to bring society to its knees and extract terrible vengeance.

https://archive.is/K9qBm

> We have talked about this before, and one category of crime that a ransomware compliance officer might reject is “hacks that are so big and disastrous that they could call down the wrath of the US government and shut down the whole business.” But another category of off-limits crime appears to be “hacks that are so morally reprehensible that they will lead to other criminals boycotting your business.”

>> A global ransomware operator issued an apology and offered to unlock the data targeted in a ransomware attack on Toronto’s Hospital for Sick Children, a move cybersecurity experts say is rare, if not unprecedented, for the infamous group.

>> LockBit’s apology, meanwhile, appears to be a way of managing its image, said [cybersecurity researcher Chester] Wisniewski.

>> He suggested the move could be directed at those partners who might see the attack on a children’s hospital as a step too far.

> If you are one of the providers, you have to choose your hacker partners carefully so that they do the right amount of crime: You don’t want incompetent or unambitious hackers who can’t make any money, but you also don’t want overly ambitious hackers who hack, you know, the US Department of Defense, or the Hospital for Sick Children. Meanwhile you also have to market yourself to hacker partners so that they choose your services, which again requires that you have a reputation for being good and bold at crime, but not too bold. Your hacker partners want to do crime, but they have their limits, and if you get a reputation for murdering sick children that will cost you some criminal business.


> I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

Absolutely this is what will happen.

I don't know much about the practice of AV definition-like feature across Cybersecurity but I would imagine there might be a possibility that no vendors do rolling update today because it involves Opt-in/Opt-out which might influence the vendor's speed to identify attack which in turns affect their "Reputation" as well.

"I bought Vendor-A solution but I got hacked and have to pay Ransomware" (with a side note: because I did not consume the latest critical update of AV definition) is what Vendors worried.

Now that this Global Outage happened, it will change the landscape a bit.


>Now that this Global Outage happened, it will change the landscape a bit.

I seriously doubt that. Questions like "why should we use CrowdStrike" will be met with "suppose they've learned their lesson".


I'm referring to the landscape how current Cybersecurity vendors deliver "detection definition" (for lack of better phrase) to their customers.

If you don't send them fast to your customer and your customer gets compromised, your reputation gets hit.

If you send them fast, this BSOD happened.

It's more like damn if you do, damn if you don't.


> If you don't send them fast to your customer and your customer gets compromised, your reputation gets hit.

> If you send them fast, this BSOD happened.

> It's more like damn if you do, damn if you don't.

What about notifications? If someone has an update policy that disable auto-updates to a critical piece of infrastructure, you can still let him know that there's a critical update is available. Now, he can do follow his own checklist in order to ensure everything goes well.


What if they're sleeping and won't read the notification until they wake up?

Wouldn't they get compromised?


most people will defer updates indefinitely if they are able to.


Okay, but who has more domain knowledge when to deploy? A "security expert" that created the "security product" that operates with root privileges and full telemetry, or IT staff member that looked at said "security expert" value proposition and didn't have issue with it.

Honestly, this reads as a suggestion that even more blame ought to be shifted to the customer.


The AV definition delivery is part of UX of the product.


> They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Is it really all that surprising? This is basically their business model - its a fancy virus scanner that is supposed to instantly respond to threats.


> They didn’t allow their clients any opportunity to test those changes before they took effect

I’d argue that anyone that agrees to this is the idiot. Sure they have blame for being the source of the problem, but any CXO that signed off on software that a third party can update whenever they’d like is also at fault. It’s not an “if” situation, it’s a “when”.


I felt exactly the same when I read about the outage. What kind of CTO would allow 3rd party "security" software to automatically update? That's just crazy. Of course, your own security team would do some careful (canary-like) upgrades locally... run for a bit... run some tests, then sign-off. Then upgrade in a staged manner.


Pretty sure many people see the point of having Falcon as a reason to not have an internal security team.

Outsource everything.


This is a great point that I never considered. Many companies subscribing to CrowdStrike services probably thought they took a shortcut to completely outsource they cyber-security needs. Oops, that was a mistake.


They deployed changes to their software directly to customer production machines

This is part of the premise of EDR software.


>I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

If indeed this happens, I'd hail this event as a victory overall; but industry experience tells me that most of those companies will say "it'd never happen with us, we're a lot more careful", and keep doing what they're doing.


I really wish we would get some regulation as a result of this. I know people that almost died due to hospitals being down. It should be absolutely mandatory for users, IT departments, etc. to be able to control when and where updates happen on their infrastructure but *especially* so for critical infrastructure.


Does anyone test their antivirus updates individually as a customer? I thought they happen multiple times a day, who has time for that?


Some sort of comprehensive test is unlikely.

But canary / smoke tests, you can do, if the vendor provides the right tools.

It's a cycle: pick the latest release, do some small cluster testing, including rollback testing, then roll out to 1%, if those machines are (mostly) still available in 5 minutes, roll out to 2%, if the 3% is (mostly) still available in 5 minutes, roll out to 4%, etc. If updates are fast and everything works, it goes quick. If there's a big problem, you'll have still have a lot of working nodes. If there's a small problem, you have a small problem.

It's gotta be automated though, but with an easy way for a person to pause if something is going wrong that the automation doesn't catch. If the pace is several updates a day, that's too much for people, IMHO.


Which EDR vendor provides a mechanism for testing virus signatures? This is the first time I'm hearing it and I'd like to learn more to close that knowledge gap. I always thought they are all updated ASAP, no exceptions.


Microsoft Defender isn't the most sophisticated EDR out there, but you can manage its updates with WSUS. It's been a long time since I've been subject to a corporate imposed EDR or similar, but I seem to recall them pulling updates from a company owned server for bandwidth savings, if nothing else. You can trickle update those with network controls even if the vendor doesn't provide proper tools.

If corporate can't figure out how to manage software updates on their managed systems, the EDR software is the command and control malware the EDR software is supposed to prevent.


Yes? Not consumers typically, but many IT departments with certain risk profiles absolutely do.


Now let's see if Microsoft listen and fixes Windows updates


I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.

The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.


I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.


Even a staged rollout over a few hours would have made a huge difference here. "Slow" in the context of a rollout can still be pretty fast.


But it can also still be way too slow in the context of an exploit that is being abused globally.


Sure but GP is praising "deploy so slowly that people complain."


Seriously like rolling out on some exponential scale even over the course of 10 minutes would have stopped this dead in its tracks


In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.

In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.


They need a lab full of canaries.


yeah, I don't get the "we couldn't have tested it" crap, because "something happens to the payload after we tested it". Create a fake downstream company and put a bunch of machines in it. That's your final test before releasing to the rest of the world.


> let [...] owners control when to update

The only acceptable update strategy for all software regardless of size or importance


Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

> Enhance existing error handling in the Content Interpreter.

That's it.

Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.

> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

Could it say any less? I hope the new check is a test fleet.

But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".


> it sounds like they might have separate "validation" code

That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."

Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.


I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?


Maybe they have the same parser in the validator and the real driver, but the vagaries of the C language mean that when undefined behavior is encountered, it may crash or it may work just by chance.


I understand what you're saying. But ~8.5 million machines in 78 minutes isn't a fluke caused by undefined behavior. All signs so far indicate that they would have caught this if they'd had even a modest test fleet. Setting aside the ways they could have prevented it before it reaching that point.


That's besides the point. Of course they need a test fleet. But in the absence of that, there's a very real chance that the existing bug triggered on customer machines but not their validator. This thread is speculating on the reason why their existing validation didn't catch this issue.



> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"

That stood out to me as well.

Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.

This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.


That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.

And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?

Echoes of the Sony BMG rootkit.

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...


Focusing on the rollout and QA process is the right thing to do.

The bug itself is not particularly interesting, nor is the fix for it.

The astounding thing about this issue, is the scale of the damage it caused, and that scale is all due to the rollout process.


Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.


Is error handling enough? A perfectly valid rule file could hang (but not outright crash) the system, for example.


If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)

I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.

* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.


No matter what sort of static validation they attempt, they're still risking other unanticipated effects. They could stumble upon a bug in the OS or some driver, they could cause false positives, they could trigger logspew or other excessive resource usage.

Failure can happen in strange ways. When in a position as sensitive as deploying software to far-flung machines in arbitrary environments, they need to be paranoid about those failure modes. Excuses aren't enough.


It's not paranoia if you can crash the kernel.


Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.


Increase counter when you start loading

Have timeout

Decrement counter after successful load and parse

Check counter on startup. If it is like 3, maybe consider you are crashing


> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

It compiled, so they shipped it to everyone all at once without ever running it themselves.

They fell short of "works on my machine".


> How Do We Prevent This From Happening Again?

> Software Resiliency and Testing

> * Improve Rapid Response Content testing by using testing types such as:

> * Local developer testing

So no one actually tested the changes before deploying?!


And why is it "local developer testing" and not CI/CD. This makes them look like absolute amateurs.


> This makes them look like absolute amateurs.

This applies also to all Architects and CTO's at all these Fortune 500 companies, who allowed these self updating systems into their critical systems.

I would offer a copy of Antifragile to each of these teams: https://en.wikipedia.org/wiki/Antifragile_(book)

"Every captain goes down with every ship"


Architects likely do not have a choice. These things are driven by auditors and requirements for things like insurance or PCI and it’s expensive to protest those. I know people who’ve gone full serverless just to lop off the branches of the audit tree about general purpose server operating systems, and now I’m wondering whether anyone is thinking about iOS/ChromeOS for the same reason.

The more successful path here is probably demanding proof of a decent SDLC, use of memory-safe languages, etc. in contract language.


> Architects likely do not have a choice.

Architects don't have a choice, CTO are well paid to golf with the CEO and delegate to their teams, Auditors just audit but are not involved with the technical implementations, Developers just develop according to the Spec, and Security team just are a pain in the ass. Nobody owns it...

Everybody get's well paid, and at the end we have to get lessons learned...It's a s*&^&t show...


Some industries are forced by regulation or liability to have something like crowdstrike deployed on their systems. And crowdstrike doesn't have a lot of alternatives that tick as many checkboxes and are as widely recognized.


Please give me an example of that specific regulation.


There's a whole body of regulation around service providers to the U.S. Government making it an effective requirement to use this stuff, starting with the FedRAMP Authorization Act (https://www.congress.gov/117/bills/hr7776/BILLS-117hr7776enr...).

See also Section 4.2.4 of the FedRAMP Moderate Readiness Assessment Report (RAR) which can be found here: https://www.fedramp.gov/documents-templates/ as an example.

You cannot obtain an Authorization To Operate (ATO) unless you've satisfied the Assessor that you're in compliance.


PCI DSS v4.0 Requirements 5 and 6 speaks very broadly for anti-malware controls, which Crowdstrike provides as EDR, and cybersecurity (liability, ransomware, etc) insurance absolutely requires it from the questionnaires I’ve completed and am required to attest to.

> In its first version, PCI DSS included controls for detecting, removing, blocking, and containing malicious code (malware). Until version 3.2.1, these controls were generically referred to as "anti-virus software", which was incorrect technically because they protect not just against viruses, but also against other known malware variants (worms, trojans, ransomware, spyware, rootkits, adware, backdoors, etc.). As a result, the term "antimalware" is now used not only to refer to viruses, but also to all other types of malicious code, more in line with the requirement's objectives.

> To avoid the ambiguities seen in previous versions of the standard about which operating systems should have an anti-malware solution installed and which should not, a more operational approach has been chosen: the entity should perform a periodic assessment to determine which system components should require an anti-malware solution. All other assets that are determined not to be affected by malware should be included in a list (req. 5.2.3).

> Updates of the anti-malware solution must be performed automatically (req. 5.3.1).

> Finally, the term "real-time scanning" is explicitly included for the anti-malware solution (this is a type of persistent, continuous scanning where a scan for security risks is performed every time a file is received, opened, downloaded, copied or modified). Previously, there was a reference to the fact that anti-malware mechanisms should be actively running, which gave rise to different interpretations.

> Continuous behavioral analysis of systems or processes is incorporated as an accepted anti-malware solution scanning method, as an alternative to traditional periodic (scheduled and on-demand) and real-time (on-access) scans (req. 5.3.2).

https://www.advantio.com/blog/analysis-of-pci-dss-v4.0-part-...


Besides things like FedRAMP mentioned in other comments, some large enterprise customers, especially banks, require terms in the contract stating the vendor uses some form of anti-malware software.


Seems like everyone thinks that Execs play golf with another Execs to seal the deal regardless how b0rken the system is.

That CTO's job is on the line if the system can't meet the requirement, more so if the system is fucked.

To think that every CTO is dumbass is like saying "everyone is stupid, except me, of course"


Not all CTO...but you just saw hundreds of companies, who could do better....


That is true, hundred companies have no backup process in place :D


They don't care, CI/CD, like QA, is considered a cost center for some of these companies. The cheapest thing for them is to offload the burden of testing every configuration onto the developer, who is also going to be tasked with shipping as quickly as possible or getting canned.

Claw back executive pay, stock, and bonuses imo and you'll see funded QA and CI teams.


It sure sounds like the "Content Validator" they mention is a form of CI/CD. The problem is that it passed that validation, but was capable of failing in reality.


The content validator is a form of validation done in CI. Their CD pipeline is the bigger problem here: it was extremely reckless given the system it was used in (configuring millions of customer machines in unknown environments). A CD pipeline for a tiny startup's email service can just deploy straight away. Crowdstrike (as they finally realized) need a CD pipeline with much more rigorous validation.


The fact that they even listed "local developer testing" is pretty weird.

That is just part of the basic process and is hardly the thing that ensures a problem like this doesn't happen.


This also becomes a security issue at some point. If these updates can go in untested, what's to stop a rogue employee from deliberately pushing a malicious update?

I know insider threats are very hard to protect against in general but these companies must be the most juicy target for state actors. Imagine what you could do with kernel space code in emergency services, transport infrastructure and banks.


CrowdStrike is more than big enough to have a real 2000’s-style QA team. There should be actual people with actual computers whose job is to break the software and write bug reports. Nothing is deployed without QA sign off, and no one is permitted to apply pressure to QA to sign off on anything. CI/CD is simply not sufficient for a product that can fail in a non-revertable way.

A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.


Besides missing the actual testing (!), the staged rollout (!), looks like they also weren't fuzzing this kernel driver that routinely takes instant worldwide updates. Oops.


check their developer github, "i write kernel-safe bytecode interpreters" :D, [link redacted]


He Codes With Honor(tm)


They bypassed the tests and staged deployment, because their previous update looked good. Ha.

What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.


Also it must have been a manual testing effort, otherwise there would be no motive to skip it. IOW, missing test automation.


This feels natural, though: the first time you do something you do it 10x more slowly because there's a lot more risk. Continuing to do things like that forever isn't realistic. Complacency is a double-edged sword: sometimes it gets us to avoid wasting time and energy on needless worry (the first time someone drives a car they go 5 mph and brake at anything surprising), sometimes it gets us to be too reckless (drivers forgetting to check blind spots or driving at dangerous speeds).


Where do you see that, it looks like there was a bug in the template tester? Or you mean the manual tests?


> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.


I don't read it as _bypassing tests_. They have tested the interpreter (`template type`) when it was first released, and they have _validated_ the new template instance (via `content validator`) and assumed this is enough, because it was enough in the past. None of the steps in the usual process were bypassed, and everything was done by the (their) book.

But it looks to me there's no integration test in the process at all. They're effectively unit testing the interpreter (template type), unit testing (validating) the "code" (template instance), but their testing strategy never actually runs the code on the interpreter (or, executes the template instance against the template type).


> I don't read it as _bypassing tests_.

You can't bypass the tests if you don't have them? <insert meme here>

They don't even bother to do the most simple smoke test ever of running their software on a vanilla configuration and remind me again because I have trouble understanding what we're exactly trying to argue here.


They know better obviously, transcending process and bureaucracy.


Same thing happened with Falcon on Debian before. Later they admitted that they didn't test some platforms they were releasing. Never heard of Docker?

How can you keep on with such a Q&R manager? He'll cost them billions


Docker wouldn't help with testing kernel modules. You'd need a VM.


In my experience with outages, usually the problem lies in some human error not following the process: Someone didn't do something, checks weren't performed, code reviews were skipped, someone got lazy.

In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?

They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?


> In my experience with outages, usually the problem lies in some human error not following the process

Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

> what kind of bug? Could it have been prevented with proper testing or code review?

It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.

Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.

More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"


> Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

You can't reliably fix problems you don't understand.


> ...what was the process in place and why did it fail?

It appears the process was:

1. Channel files are considered trusted; so no need to sanity-check inputs in the sensor, and no need to fuzz the sensor itself to make sure it deals gracefully with corrupted channel files.

2. Channel files are trusted if they pass a Content Validator. No additional testing is needed; in particular, the channel files don't even need to be smoke-tested on a real system.

3. A Content Validator is considered 100% effective if it has been run on three previous batches of channel files without incident.

Now it's possible that there were prescribed steps in the process which were not followed; but those too are to be expected if there is no automation in place. A proper process requires some sort of explicit override to skip parts of it.


"Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."

So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.


As I understand, it is incredibly difficult to prove "gross negligence". It is better to pressure them to settle in a giant class action lawsuit. I am curious what the total amount of settlements / fines will be in the end. I guess ~2B USD.


Same here. Our losses were quite significant - between lost productivity, inability to provide services, inability of our clients to actually use contracted services, and having to fix their mess - its very easily in the millions.

And then there will be the costs of litigation. It was crazy in the IT department over the weekend, but not much less crazy in our legal teams, who were being bombarded with pitches from law firms offering help in recovery. It will be a fun space to watch, and this 'we haven't tested because we, like, did that before and nothing bad happened' statement in the initial report will be quoted in many lawsuits.


To be clear: I do not expect the settlement to bankrupt them, but I do expect it to be painful. And, when you say "easily in the millions" -- good luck to demonstrate that in a class action lawsuit, and have the judge believe you. It is much harder than people think. You will be lucky to recoup 10% of those expenses after a settlement. Also, your company may also have cyber-security insurance. (Yes, the insurance companies will join the class action lawsuit, but you cannot get blood from a stone. There will be limits about the settlement size.)


Why do they insist on using what sounds like military pseudo jargon throughout the document?

ex. sensors? I mean how about hosts, machines, clients?


It’s endemic in the tech security industry - they’ve been mentally colonised by ex-mil and ex-law enforcement (wannabe mil) folks for a long time.

I try to use social work terms and principles in professional settings, which blows these people’s minds.

Advocacy, capacity evaluation, community engagement, cultural competencies, duty of care, ethics, evidence-based intervention, incentives, macro-, mezzo- and micro-practice, minimisation of harm, respect, self concept, self control etc etc

It means that my teams aren’t focussed on “nuking the bad guys from orbit” or whatever, but building defence in depth and indeed our own communities of practice (hah!), and using psychological and social lenses as well as tech and adversarial ones to predict, prevent and address disruptive and dangerous actors.

YMMV though.


Even computer security itself is a metaphor (at least in its inception). I often wonder what if instead of using terms like access, key, illegal operation, firewall, etc. we'd instead chosen metaphors from a different domain, for example plumbing. I'm sure a plumbing metaphor could also be found for every computer security concern. Would be so quick to romanticize as well as militarize a field dealing with "leaks," "blockages," "illegal taps," and "water quality"?


“Fatbergs” expresses some things delivered by some teams very eloquently for me!


Alternate dimension PR comment: looks flushable to me


"military grade encryption!" aka just AES-256

always makes me laugh


The sensor isn't a host, machine, or a client. It's the software component that detects threats. I guess maybe you could call it an agent instead, but I think sensor is pretty accepted terminology in the EDR space - it's not specific to Crowdstrike.


because those things are different? i didn't see a single "military" jargon. there is absolutely nothing unusual about their wording. It's like someone saying "why do these people use such nerdy words" regarding HN content.


This reads like a bunch of baloney to obscure the real problem.

The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.


I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed. The architectural changes are the more interesting bits, and they're covered reasonably well. Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code. Your fourth one is a fair point: building in watchdogs of some sort to prevent a crashloop would be good. Also having a remote killswitch that can be checked before turning the sensor on would have helped in containing the damage of a crashloop. Your last one I feel like is mostly redundant with a lot of the follow-ups they did commit to.

It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.


>I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed.

I was not talking about the code that crashed.

I guess what I wrote was non-obvious enough that it needs an explanation:

— fixing whatever produced "problematic content":

The release doesn't talk about the subsystem that produced the "problematic content". The part that crashed was the interpreter (consumer of the content); the part that generated the "problematic content" might have worked as intended, for all we know.

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes:

I am not talking about fixing this particular crash.

I am talking about design choices that allow such crashes in principle.

In this instance, the interpreter seemed to have been reading memory addresses from a configuration file (or something that would be equivalent to doing that). Adding an additional check will fix this bug, but not the fundamental issue that an interpreter should not be doing that.

>The architectural changes are the more interesting bits, and they're covered reasonably well

They are not covered at all. Are we reading the same press release?

>Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code.

Yes, that's the problem I am pointing out: the "validator" and "interpreter" should be the same code. The "validator" can issue commands to a mock operating system instead of doing real API calls, but it should go through the input with the actual interpreter.

In other words, the interpreter should be a part of the validator.

>It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.

Sure; that's my subjective assessment. Personally, I am very dissatisfied with their post-mortem. If you are happy with it, that's fair, but you'd need to say more if you want to make a point in addition to "the architectural changes are covered reasonably well".

Like, which specific changes those would be, for starters.


>Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

>Enhance existing error handling in the Content Interpreter.

They did write that they intended to fix the bugs in both the validator and the interpreter. Though it's a big mystery to me and most of the comments on the topic how an interpreter that crashes on a null template would ever get into production.


>They did write that they intended to fix the bugs

I strongly disagree.

Add additional validation and enhance error handling say as much as "add band-aids and improve health" in response to a broken arm.

Which is not something you'd want to hear from a kindergarten that sends your kid back to you with shattered bones.

Note that the things I said were missing are indeed missing in the "mitigation".

In particular, additional checks and "enhanced" error handling don't address:

— the fact that it's possible for content to be "problematic" for interpreter, but not the validator;

— the possibility for "problematic" content to crash the entire system still remaining;

— nothing being said about what made the content "problematic" (spoiler: a bunch of zeros, but they didn't say it), how that content was produced in the first place, and the possibility of it happening in the future still remaining;

— the fact that their clients aren't in control of their own systems, have no way to roll back a bad update, and can have their entire fleet disabled or compromised by CrowdStrike in an instant;

— the business practices and incentives that didn't result in all their "mitigation" steps (as well as steps addressing the above) being already implemented still driving CrowdStrike's relationship with its employees and clients.

The latter is particularly important. This is less a software issue, and more an organizational failure.

Elsewhere on HN and reddit, people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability", make it practically impossible to release well-tested code, and that reliance on a rootkit for security is little more than CYA — which means that the writing was on the wall, and this will happen again.

You can't fix bad business practices with bug fixes and improved testing. And you can't fix what you don't look into.

Hence my qualification of this "review" as a red herring.


> people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability

I didn't see people explaining why this was ridiculous.

> make it practically impossible to release well-tested code

That falsely presumes the release must be code.

CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."


>I didn't see people explaining why this was ridiculous.

Because of how it affects priorities and incentives.

E.g.: as of 2024, CrowdStrike didn't implement staggered rollout of Rapid Response content. If you spend a second thinking why that's the case, you'll realize that rapid and staggered are literally antithetical.

>CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."

Well, they are lying.

The data that you feed into an interpreter is code, no matter what they want to call it.


It's not your kid, so "improve health" is the industry standard response here.


True, but the question is why they can keep getting away with that.


What validates the Content Validator? A Content Validator Validator?


> fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

Better not only fix this specific bug but continuously use fuzzing to find more places where external data (including updates) can trigger a crash (or worse RCE)


That is indeed necessary.

But it seems to me that putting the interpreter in a place in the OS where it can cause a system crash with the be the behavior that it's allowed to do is a fundamental design choice that is not at all addressed by fuzzing.


An interpreter that handles data downloaded from the internet even. That's an exploit waiting to happen.


I guess "fight fire with fire" is great adage, so why not fight backdoors with backdoors. What can go wrong.


Also “using memory safe languages for critical components” and “detecting failures to load and automatically using the last-known-good configuration”


Direct link to the PIR, instead of the list of posts: https://www.crowdstrike.com/blog/falcon-content-update-preli...


The article link has been updated to that; it used to be the "hub" page at https://www.crowdstrike.com/falcon-content-update-remediatio...

Some updates from the hub page:

They published an "executive summary" in PDF format: https://www.crowdstrike.com/wp-content/uploads/2024/07/Crowd...

That includes a couple of bullet points under "Third Party Validation" (independent code/process reviews), which they added to the PIR on the hub page, but not on the dedicated PIR page.

> Updated 2024-07-24 2217 UTC

> ### Third Party Validation

> - Conduct multiple independent third-party security code reviews.

> - Conduct independent reviews of end-to-end quality processes from development through deployment.


I so hate it when people fill these postmortems with marketing speak. Don't they know it is counterproductive?


> How Do We Prevent This From Happening Again?

> * Local developer testing

Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.


Such a disingenuous review; waffle and distraction to hide the important bits (or rather bit: bug in content validator) behind a wall of text that few people are going to finish.

If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.

> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability

Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.


> "behind a wall of text that few people are going to finish."

heh? it's not that long and very readable.


I disagree; it's much longer than it needs to be, is filled with pseudo-technoese to hide that there's little of consequence in there, and the tiny bit of real information in there is couched with distractions and unnecessary detail.

As I understand it, they're telling us that the outage was caused by an unspecified bug in the "Content Validator", and that the file that was shipped was done so without testing because it worked fine last time.

I think they wrote what they did because they couldn't publish the above directly without being rightly excoriated for it, and at least this way a lot of the people reading it won't understand what they're saying but it sounds very technical.


no, it's one of most well written PIR's I've seen. It establishes terms and procedures after communicating that this isn't an RCA, then they detail the timeline of tests and deployments done and what went wrong. They were not excessively verbose or terse. This is the right way of communicating to the intended audience. It is both technical people, executives and law makers alike that will be reading this. They communicated their findings clearly without code, screenshots, excessive historical details and other distractions.


If you think this is good, go look at a Cloudflare postmortem. The fly.io ones are good too.

Way less obscure language, way more detail and depth, actually owning the mistakes rather than vaguely waffling on. This write up from CrowdStrike is close to being functionally junk.


One of the first things they've stated is that this isn't an RCA (deep dive analysis) like cloudflare and fly.io's, that's not what this is. This is to brief customers and the public of their immediate post-mortem understanding of what happened. The standard for that is different than an RCA.


In the current situation, it's better to be complete no?

This information is not just for _you_.


Do you see how they only talk about technical changes to prevent this from happening again?

To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?

They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.


Well I'm glad they at least released a public postmortem on the incident. To be honest, I feel naive saying this, but having worked at a bunch of startups my whole life, I expected companies like CrowdStrike to do better than not testing it on their own machines before deploying an update without the ability to roll it back.


I see a path to this every day.

An actual scenario: Some developer starts working on pre deployment validation of config files. Let's say in a pipeline.

Most of the time the config files are OK.

Management says: "Why are you spending so long on this project, the sprint plan said one week, we can't approve anything that takes more than a week."

Developer: "This is harder than it looks" (heard that before).

Management: "Well, if the config file is OK then we won't have a problem in production. Stop working on it".

Developer: Stops working on it.

Config file with a syntax error slips through, .. The rest is history


One lesson I've learned from this fiasco is to examine my own self when it comes to these situations. I am so befuddled by all the wild opinions, speculations and conclusions as well as observations of the PIR here. You can never have enough humility.


"We didn't properly test our update."

Should be the tldr. On threads there's information about CrordStrike slashing QA team numbers, whether that was a factor should be looked at.


They write perfect software. Why should they test it ? /s


"problematic content"? It was a file of all zero bytes. How exactly was that produced?


If I had to guess blindly based on their writeup, it would seem that if their Content Configuration System is given invalid data, instead of aborting the template, it generates a null template.

To a degree it makes sense because it's not unusual for a template generator to provide a null response if given invalid inputs however the Content Validator then took that null and published it instead of handling the null case as it should have.


Returning null instead of throwing an exception when an error occurs is the quality of programming I see from junior outsourced developers.

“if (corrupt digital signature) return null;”

is the type of code I see buried in authentication systems, gleefully converting what should be a sudden stop into a shambling zombie of invalid state and null reference exceptions fifty pages of code later in some controller that’s already written to the database on behalf of an attacker.

If I peer into my crystal ball I see a vision of CrowdStrike error handling code quality that looks suspiciously the same.

(If I sound salty, it’s because I’ve been cleaning up their mess since last week.)


>Returning null instead of throwing an exception when an error occurs is the quality of programming I see from junior outsourced developers.

This is kernel code, most likely written in C (and regardless of language, you don't really do exceptions in the kernel at all for various reasons).

Returning NULL or ERR_PTR (in the case of linux) is absolutely one of the most standard, common, and enforced ways of indicating an error state in kernel code, across many OS's.

So it's no surprise to see the pattern here, as you would expect.


The've said the crash was not related to those zero bytes. https://www.crowdstrike.com/blog/falcon-update-for-windows-h...


Will managers continue to push engineers even when engineers advise to go slower or no?


Always.


So this event is probably close to a worst case scenario for an untested sensor update. But have they never had issues with such untested updates before, like an update resulting in false positives on legitimate software? Because if they did, that should have been a clue that these types if updates should be tested too.


Crowdstrike issues false positives allll the time. They'll fix them and then they'll come back in a future update. One such false positive is an empty file. Crowdstike hates empty files.


I feel like for a system that is this widely used and installed in such a critical position that upon a BSOD crash due to a faulting kernel module like this, the system should be able to automatically roll back to try the previous version on subsequent boot(s).


I really dislike reading website that take over half the screen and make me read off to the side like this. I can fix it by zooming in but I don't understand why they thought making the navigation take up that much of the screen or not be collapsable was a good move.


>When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception.

Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.


They specifically denied that null bytes were the issue in an earlier update. https://www.crowdstrike.com/blog/falcon-update-for-windows-h...


Null pointers, not a null array


I'm not sure what you're saying, but note that a file fundamentally cannot contain a null pointer, a file can just contain various bytes.


Still have kernel access


1) Everything went mostly well

2) The things that did not fail went so great

3) Many many machines did not fail

4) macOS and Linux unaffected

5) Small lil bug in the content verifier

6) Please enjoy this $10 gift card

7) Every windows machine on earth bsod'd but many things worked


Regarding the gift card, TechCrunch says

"On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”"

https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-ap...


There's a KB up about this now. To use your voucher, reboot into safe mode and...


On another forum a person replied…

>The system to redeem the card is probably stuck in a boot loop


I get that canary rollout is tricky in this business, since it's all about stopping the spread of viruses and attacks.

That said, this incident review doesn't mention numbers, unless I missed it; how colossal of a fuck up it was.

The reality is that they don't apologize "bad shit just happens", they work their engineers to the grave, make no apology and completely screw up. This reads like a minor bump in processes.

Crowdstrike engineered the biggest computer attack the world has ever seen, with a sole purpose of preventing those. They're slowly becoming the Oracle of security and I see no sign of improvement here.


Fun post, but I'll state the obvious because I think many people do believe that every Windows machine BSOD'd. It was only ones with Crowdstrike software. Which is apparently very common but isn't actually pre-installed by Microsoft in Windows, or anything like that.

Source: work in a Windows shop and had a normal day.


True, and definitely worth a mention. This is only Microsoft's fault insofar as it was possible at all to crash this way, this broadly, with so little recourse via remote tooling.


Which is a non-trivial amount of fault, given that Apple disallows the equivalent behavior on macOS.


A summary, to my understanding:

* Their software reads config files to determine which behavior to monitor/block

* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"

* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions

* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully


* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions

that's crazy. How costly can it be to test the file fully in a CI job? I fail to see how this wasn't implemented already.


> How costly can it be to test the file fully in a CI job?

It didn't need a CI job. It just needed one person to actually boot and run a Windows instance with the Crowdstrike software installed: a smoke test.

TFA is mostly an irrelevent discourse on the product architecture, stuffed with proprietary Crowdstrike jargon, with about a couple of paragraphs dedicated to the actual problem; and they don't mention the non-existence of a smoke test.

To me, TFA is not a signal that Crowdstrike has a plan to remediate the problem, yet.


They mentioned they do dogfooding. Wonder why it did not work for this update.


They discuss dogfooding “Sensor Content”, which isn’t “Rapid Response Content”.

Overall the way this is written up suggests some cultural problems.


You just got tricked by this dishonest article. The whole section that mentions dogfooding is only about actual updates to the kernel driver. This was not a kernel driver update, the entire section is irrelevant.

This was a "content file", and the first time it was interpreted by the kernel driver was when it was pushed to customer production systems worldwide. There was no testing of any sort.


All these people claiming they didn’t have canaries. They actually did but people are in denial that they are the canary for crowdstrike lol


It's worse than that -- if your strategy actually was to use the customer fleet as QA and monitoring, then it probably wouldn't take you an hour and a half to notice that the fleet was exploding and withdraw the update, as it did here. There was simply no QA anywhere.


Just reeks of incompetence. Do they not have e2e smoketests of this stuff?


Cowards. Why don't you just stand up and admit that you didn't bother testing everything you send to production?

Everything else is smoke and the smell of sulfur.


> Why don't you just stand up and admit that you didn't bother testing everything you send to production?

The "What Happened on July 19, 2024?" section combined with the "Rapid Response Content Deployment" make it very clear to anyone reading that that is the case. Similarly, the discussion of the sensor release process in "Sensor Content" and lack of discussion of a release process in the "Rapid Response Content" section solidify the idea that they didn't consider validated rapid response content causing bad behavior as a thing to worry about.


Because producing smoke and the smell of sulfur is how you keep your business afloat after an incident like this

Getting on your knees and admitting terrible fault with apologies galore isn't going to garner you any more sympathy.


A file full of zeros is an "undetected error"? Good grief.


It wasn't a file full of zeros that caused the problem.

While some affected users did have a file full of zeros, that was actually a result of the system in the process of trying to download an update, and not the version of the file that caused the crash.


Here is my summary with the marketing bullshit ripped out.

Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.

One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.

"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.

"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.

They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.

(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.

---

In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.

Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.


> Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

Would have happened long time ago if it was that easy no?


How do we know it hasn't?


If it happened, the industry would have known by now.

The group behind it will come out to the public.


This would be the kind of vulnerability that would be worth millions of dollars and used for targeted attacks and/or by state actors. It could take years to uncover (like Pegasus, which took 5 years to be discovered) or never be uncovered at all.


Probably not, if you're implying remote code execution -- it was an out of bounds READ operation, not write, causing an immediate crash. Unlikely to be useful for anything other than taking systems offline (which can certainly be useful, but is not RCE).


It was a read operation during bytecode template initialization, in a driver that reads userland memory. An out of bound read operation to load code in a driver that maps user memory can easily lead to code execution and privilege escalation: if the attacker finds a way to get the out of bound read into memory they control, they could cause the driver to load a manufactured template and inject bytecode.

It's not clear that this specific vulnerability is exploitable, but it's exactly the kind of vulnerability that could be exploited for code execution.


> Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

You would have to get into the supply chain to do much damage.

Otherwise, you would somehow need access to the hosts running the agent.

If you a threat-actor that already has access to hosts running CS, at a scale that would make the news, why would you blow your access on trying to ruin CS's reputation further?

Perhaps if you are a vendor of a competing or adjacent product that deploys an agent, you could deliberately try and crash the CS agent, but you would be caught.


Copying my content from the duplicate thread[1] here:

This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

[1] https://news.ycombinator.com/item?id=41053703


> Copying my content from the duplicate thread[1] here

Please don't do this! It makes merging threads a pain because then we have to find the duplicate subthreads (i.e. your two comments) and merge the replies as well.

Instead, if you or anyone will let us know at hn@ycombinator.com which threads need merging, we can do that. The solution is deduplication, not further duplication!


Oops. Noted!

Apologies for inadvertently adding work!

Somehow, I never realized that duplicate threads were merged (instead of one of them being nuked), because it seems like a lot of work in the first place.

Thanks for doing it!


Appreciated!


[flagged]


The thread is still wrong, since it was a OOB memory read, not a missing null pointer check as claimed. 0x9c is likely the value that just happened to be in the OOB read.


Not really, that thread showed only superficial knowledge and analysis, far from hitting the nail on the head, for anyone used to assembly/reverse engineering. Then goes on to make provably wrong assumptions and comments. There is actually a null check (2 even!) just before trying the memory access. The root cause is likely trying to access an address that's coming from some uninitialized or wrongly initialized or non-deterministically initialized array.

What it did well was explaining the basics nicely for a wide audience who knows nothing about a crash dump or invalid memory access, which I guess made the post popular. Good enough for a general public explanation, but doesn't pass the bar for an actual technical one to any useful degree.

I humbly concur with Tavis' take

https://x.com/taviso/status/1814762302337654829

Here are some others for more technically correct details: - https://x.com/patrickwardle/status/1814343502886477857 - https://x.com/tweetingjose/status/1814785062266937588


"Incoming data triggered a out-of-bound memory access bug" is hardly a useful conclusion for a root cause investigation (even if you are of the faith of the single root cause).


No, the Twitter poster is still wrong.



How can these companies be certified and compliant, etc., and then in practice have horrible SDLC?

What was the impact of diverse teams (offshoring)? Often companies don’t have necessary checks to ensure disparateness of teams does not impact quality. Maybe it was zero or maybe it was more.


Standards generally don't mandate specifics and almost certainly nothing specific to SDLC. At least none I've heard of. Things like FIPS and ISO and SOC2 generally prescribe having a certain process, sometimes they can mandate some specifics (e.g. what ciphers for FIPS). Maybe there should be some release process standards that prescribe how this is done but I'm not aware of any. I think part of the problem is the standard bodies don't really know what to prescribe, this sort of has to come from the community. Maybe not unlike the historical development of other engineering professions. Today being compliant with FIPS doesn't really mean you're secure and being SOC2 compliant doesn't really mean customer data is safe etc. It's more some sort of minimal bar in certain areas of practice and process.


Sadly, I agree with your take. All it is is a minimum bar. Many who don't have the above are even worse --tho not necessarily, but as a rule probably yes.


> How can these companies be certified and compliant, etc., and then in practice have horrible SDLC?

Checklists ?


You're saying there exist a complex software system without a bug despite following best practices to the dot and certified + compliant?


No, but their release process should catch major bugs such as this. After internal QA, you release to small internal dev team, then to select members of other depts willing to dog-food it, then limited external partners then GA? Or something like that so that you have multiple opportunities to catch weird software/hardware interactions before bringing down business critical systems for major and small companies around the planet?


> After internal QA, you release to small internal dev team, then to select members of other depts willing to dog-food it, then limited external partners then GA

What about AV definition update for 0day swimming in the tubes right now?


Sure, those have happened before, but nothing with an impact like last weekend. That's inexcusable. At least definitions can update themselves out of trouble.


What do you refer to "those have happened before"?

Isn't that what happened? Not a software update, not an AV-definition update but more so an AV-definition "data" update. At least that's how I interpret "Rapid Response Content"


It's called Twitter.


No, the name’s been changed.


No it hasn't.


I'm not a fan of Musk or of the-platform-formerly-known-as-Twitter, but I'm not sure how you can insist that the name hasn't been changed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: