Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
> Enhance existing error handling in the Content Interpreter.
That's it.
Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.
> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Could it say any less? I hope the new check is a test fleet.
But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
> it sounds like they might have separate "validation" code
That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."
Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.
I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?
Maybe they have the same parser in the validator and the real driver, but the vagaries of the C language mean that when undefined behavior is encountered, it may crash or it may work just by chance.
I understand what you're saying. But ~8.5 million machines in 78 minutes isn't a fluke caused by undefined behavior. All signs so far indicate that they would have caught this if they'd had even a modest test fleet. Setting aside the ways they could have prevented it before it reaching that point.
That's besides the point. Of course they need a test fleet. But in the absence of that, there's a very real chance that the existing bug triggered on customer machines but not their validator. This thread is speculating on the reason why their existing validation didn't catch this issue.
> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"
That stood out to me as well.
Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.
This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.
That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.
And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?
Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.
If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)
I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.
* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.
No matter what sort of static validation they attempt, they're still risking other unanticipated effects. They could stumble upon a bug in the OS or some driver, they could cause false positives, they could trigger logspew or other excessive resource usage.
Failure can happen in strange ways. When in a position as sensitive as deploying software to far-flung machines in arbitrary environments, they need to be paranoid about those failure modes. Excuses aren't enough.
Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.
> Enhance existing error handling in the Content Interpreter.
That's it.
Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.
> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Could it say any less? I hope the new check is a test fleet.
But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".