Lots of words about improving testing of the Rapid Response Content, very little...

SoftTalker · 2024-07-24T17:49:04 1721843344

> it sounds like they might have separate "validation" code

That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."

Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.

modestygrime · 2024-07-24T18:00:45 1721844045

I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?

kccqzy · 2024-07-25T05:19:49 1721884789

Maybe they have the same parser in the validator and the real driver, but the vagaries of the C language mean that when undefined behavior is encountered, it may crash or it may work just by chance.

Cyphase · 2024-07-25T08:35:20 1721896520

I understand what you're saying. But ~8.5 million machines in 78 minutes isn't a fluke caused by undefined behavior. All signs so far indicate that they would have caught this if they'd had even a modest test fleet. Setting aside the ways they could have prevented it before it reaching that point.

kccqzy · 2024-07-25T13:35:57 1721914557

That's besides the point. Of course they need a test fleet. But in the absence of that, there's a very real chance that the existing bug triggered on customer machines but not their validator. This thread is speculating on the reason why their existing validation didn't catch this issue.

shepherdjerred · 2024-07-25T15:10:49 1721920249

Parse, don't validate

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

pdpi · 2024-07-25T06:53:29 1721890409

> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"

That stood out to me as well.

Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.

This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.

Cyphase · 2024-07-25T08:47:18 1721897238

That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.

And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?

Echoes of the Sony BMG rootkit.

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

WatchDog · 2024-07-25T01:31:33 1721871093

Focusing on the rollout and QA process is the right thing to do.

The bug itself is not particularly interesting, nor is the fix for it.

The astounding thing about this issue, is the scale of the damage it caused, and that scale is all due to the rollout process.

gwd · 2024-07-25T07:48:53 1721893733

Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.

hun3 · 2024-07-24T14:32:21 1721831541

Is error handling enough? A perfectly valid rule file could hang (but not outright crash) the system, for example.

throwanem · 2024-07-24T17:37:15 1721842635

If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)

I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.

* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.

yuliyp · 2024-07-25T02:45:56 1721875556

No matter what sort of static validation they attempt, they're still risking other unanticipated effects. They could stumble upon a bug in the OS or some driver, they could cause false positives, they could trigger logspew or other excessive resource usage.

Failure can happen in strange ways. When in a position as sensitive as deploying software to far-flung machines in arbitrary environments, they need to be paranoid about those failure modes. Excuses aren't enough.

Cyphase · 2024-08-03T11:22:58 1722684178

It's not paranoia if you can crash the kernel.

ReaLNero · 2024-07-24T17:38:40 1721842720

Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.

DannyBee · 2024-07-24T23:19:52 1721863192

Increase counter when you start loading

Have timeout

Decrement counter after successful load and parse

Check counter on startup. If it is like 3, maybe consider you are crashing