Technical Details on Today's Outage

dang · 2024-07-20T14:20:15 1721485215

Related ongoing thread:

CrowdStrike Update: Windows Bluescreen and Boot Loops - https://news.ycombinator.com/item?id=41002195 - July 2024 (3590 comments)

PedroBatista · 2024-07-20T02:50:11 1721443811

Light on technical and light on details.

Putting the actual blast radius aside, this whole thing seems a bit amateurish for a "security company" that pulls the contracts they do.

hi-v-rocknroll · 2024-07-20T05:58:05 1721455085

Yup.

- They don't do enough or the right kind of smoke tests.

- They don't do exponential-canary deployments with an ability to rollback, and instead just YOLO it.

- They don't appear to have a customer-side security / client platform team update approval gating change control process for software updates or for definitions (or whatever they use).

This is fundamentally laziness and/or incompetency.

tootie · 2024-07-20T15:43:03 1721490183

They say they run this process multiple times a day. Must be tens of thousands of deployments. I'd guess complacency set in at some point. Just completely inured to the risks they were taking.

nine_zeros · 2024-07-20T13:44:21 1721483061

> This is fundamentally laziness and/or incompetency.

From management - that couldn't care less about low visibility, low impact projects

mattmaroon · 2024-07-20T15:49:17 1721490557

I have been wondering why they didn’t do canary deployments. Seems like quite an obvious thing to do.

_dp9d · 2024-07-20T03:07:02 1721444822

You assume the most competent company got the contracts, which is simply not the world we live in.

The company that lobbied the hardest and paid the most in bribes got the contracts.

tootie · 2024-07-20T15:49:15 1721490555

Having been through enough procurement cycles as both a buyer and seller there does not need to be a whit of malfeasance for a bad decision to occur. It's aggressive sales, price wars, poorly informed decision makers, gut instinct, favoritism, familiarity, incumbency, network effects.

You notice how this outage affected hospitals and airlines? There is a strong tendency in software sales for industries to align around one or two leaders. Oh, American chose Crowdstrike? Maybe we at Delta should just do what they did. Or literally Delta hires the VP from American to be their CISO and he just does what he did before.

Vendor selection is hard and buyer's remorse is frequently hard to deal with once you've sunk cost into a migration.

manquer · 2024-07-20T04:27:48 1721449668

Rather point I think is there technical and evaluation gates companies of this nature regularly go through while contracting, part of that is being able to talk the language of the industry properly .

This seems very amateurish for companies who regularly talk professionally to win said contracts , whether the best product or not.

My guess is C-suite, crisis consultants and lawyers are involved heavily so the actual engineering folks have little voice now in any communication and we get stuff like this.

dev-jayson · 2024-07-20T04:22:54 1721449374

Yeah, I think I'm getting more detailed analysis on Social Media from strangers, which I know I should take with a grain of salt. But I guess I'm expecting a lot more than "a filed caused this" from the company that caused this havoc.

tail_exchange · 2024-07-20T02:06:56 1721441216

Can someone who actually understands what CrowdStrike does explain to me why on earth they don't have some kind of gradual rollout for changes? It seems like their updates go out everywhere all at once, and this sounds absolutely insane for a company at this scale.

hatsunearu · 2024-07-20T02:15:25 1721441725

It sounds like Channel files are just basically definition updates in normal antivirus software; it's not actually code, just some stuff on what the software should "look out for".

And it sounds like they shipped some malformed channel file and the software that interprets it can't handle malformed inputs and ate shit. That software happened to be kernel mode, and also marked as boot-critical, so it if falls over, it causes a BSOD and inability to boot.

and it's kind of understandable that channel files might seem safe to update constantly without oversight, but that's just assuming that the file that interprets the channel file isn't a bunch of dogshit code.

omoikane · 2024-07-20T03:05:52 1721444752

Configuration files should be treated like code and follow the same gradual rollout practices. See also:

https://sre.google/workbook/canarying-releases/

Which starts with "a majority of incidents are triggered by binary or configuration pushes". The stats for config related failures is one link away at

https://sre.google/workbook/postmortem-analysis/

Where it says 31% of outages in 2010-2017 are caused by "configuration push".

Murky3515 · 2024-07-20T02:19:03 1721441943

It's not understandable imo. At the very least they should have tests for the loader component that shows it can handle corrupted input. Amateur hour.

numbsafari · 2024-07-20T02:42:22 1721443342

Agreed. We all know about a really interesting vector for infecting the kernel now. One that is poorly tested, poorly implemented, and poorly secured.

SoftTalker · 2024-07-20T02:36:52 1721443012

And though I don't know, I'm guessing it's not a certainty to say they don't contain "code." It would seem to me that they would have to, otherwise novel attacks that weren't caught by one of their existing algorithms could never be detected.

I'm guessing they contain some combination of pattern/regexp type stuff, and interpreted code/scripting with trigger criteria, etc. that all gets loaded into the "engine" that actually runs the threat detection.

throwaway346434 · 2024-07-20T02:34:52 1721442892

Halting problem is undecidable.

On the scale of "no one bothered to put error handling or validation in" to "a subtle problem exists for this given input"; you and I lack the information to make a judgement.

acdha · 2024-07-20T02:43:52 1721443432

> you and I lack the information to make a judgement.

Think about this a little harder: what do you know about the number of customers affected? We do actually have enough information to make a judgement - bricking millions of critical systems, a very high percentage of their total Windows customer population, tells us that they don’t have progressive rollouts, don’t fail into a safe mode, and that if they do have tests those tests are catastrophically unlike anything their customers run – all they had to do was launch an EC2 instance and see if it kept running.

liuliu · 2024-07-20T02:59:53 1721444393

Not doing fuzzing on user-input supported feature, especially for AV, is damning.

vanillax · 2024-07-20T02:38:23 1721443103

I mean, the whole world was impacted. All they had to do was test this change in a lab with several pcs. Clearly this wasn't a edge case nor a subtle problem. This was clearly a lack of testing.

andrewinardeer · 2024-07-20T03:51:26 1721447486

It was a Friday. Devs just wanted to go home for the weekend.

acdha · 2024-07-20T13:17:04 1721481424

Leave the spin to the PR people. Their customers pay a great deal of money for 24x7 service, and this wasn’t even a code change but a definition update – a process which should be as well defined and tested as McDonald’s making a hamburger. You wouldn’t excuse getting E. coli from your lunch with “the cook just wanted to go home for the weekend”, and this is a much more expensive service.

hatsunearu · 2024-07-20T06:05:18 1721455518

Yeah, I re-read my comment and it sounds like I am understanding of them.

But no, saying "channel files aren't kernel code" is just hilarious, considering the channel files define how the actual kernel code is supposed to behave, so it might as well be kernel code. Especially when the bad behavior in question is triggered by bad channel files!!

timbelina · 2024-07-20T06:09:30 1721455770

I was reading these two threads:

https://x.com/perpetualmaniac/status/1814376668095754753?s=4...

https://x.com/ananayarora/status/1814269058088304760

The authors explain the coding error and coredump well, but I'm lost: Is the buggy code that they're describing the channel file, or some kernel code that consumes the channel file? Is there a way to tell?

timbelina · 2024-07-20T06:24:35 1721456675

OK, and another question:-) Can tools like Valgrind and ASan pick up the kinds of errors that are described in those two links from my previous post?

ananayarora · 2024-07-20T07:19:00 1721459940

Author of the second post here. The first author's stack trace seems to show a fault on csagent.sys which is a bad read on 0x9c. There are some other .sys files loaded up by csagent.sys, and that's where the crash seems to happen, apparently.

As for detection, Zach mentions that modern tooling could've been used to find this, so I'm assuming Valgrind can find this: https://x.com/Perpetualmaniac/status/1814376690958868979

Hope this helps!

timbelina · 2024-07-20T08:36:11 1721464571

Cheers Ananay!

So if I put this all together:

a) The driver (sensor) csagent.sys includes code that hasn't checked with a tool like Valgrind or ASan or something and so includes some kind of memory management bug.

b) Since n, n-1 and n-2 versions of the sensor all died equally spectacularly, that bug as been around for at least three versions of csagent.sys.

c) The bug can be triggered by getting the csagent.sys to swallow a shitty channel file and since csagent runs in kernel mode, when it crashes it BSOD's the system.

d) Someone at Crowdstrike uploaded a shitty channel file as part of an update process that apparently happens many times a day.

Am I on the right track so far? If so, there's no/inadequate memory management checks in the csagent driver, and either:

1)There were also no checks before the borked channel file was uploaded because of a failure to follow process, or because there was no process, but whatever the case it was an accident.

or

2) Someone uploaded on purpose, not by accident, the borked channel file intending for a nasty outcome (probably not BSOD)

I can't believe that there are not a million checks and balances in place to let (1) happen, but as my grandma used to say, "Don't assume malice where stupidity will do" :-)

Hizonner · 2024-07-20T02:44:42 1721443482

> it's not actually code, just some stuff on what the software should "look out for"

If it controls the behavior of a computer, then it's code.

> and it's kind of understandable that channel files might seem safe to update constantly without oversight

Yeah, no, it's not. They pushed an update that crashed the majority of their Windows installed base in a way that couldn't be fixed remotely. It doesn't matter what the update was to. It needed to be tested. There is no way that any deployment pipeline that could fail to catch something that blatant could possibly be "understandable".

... and that kernel mode code shouldn't have been parsing anything with any complexity to begin with. And should have been tested into oblivion, and possibly formally verified.

This is amateur-hour nonsense. Which is what you expect from most of these "Enterprise Cyber Security(TM)" vendors.

... AND the users shouldn't have just gone and shoved that kind of thing into every critical path they could think of.

notepad0x90 · 2024-07-20T02:50:32 1721443832

This "channel file" is equivalent to an AV signature file. Crowdstrike is the company, the product here is "Falcon" which does behavioral monitoring of processes both on the device and using logs collected from the device in the cloud.

I can see your perspective, but you should consider this: They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content. Although the root cause is on the channel files, I feel like the driver that processes them should have been able to handle the "logic bug" in question so we'll find out more over time I guess.

For example, with windows defender which runs on virtually all windows systems, the signature updates on billions of devices are pushed immediately (with exception to enterprise systems, but even then there is usually not much testing on signature files themselves, if at all). As far as the devops process Crowdstrike uses to test the channel files, I think it's best to leave commentary on that to actual insiders but these updates happen several times a day sometimes and get pushed to every Crowdstrike customer.

joaomacp · 2024-07-20T04:45:45 1721450745

> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation

I certainly don't want to know (through disaster news) about the construction company that built the bridge I drive through everyday, not for another 15 years, not ever!

This kind of software simply should not fail, with such a massive install base on so many sensitive industries. We're better than that, the software industry is starting to mature and there are simple and widely-known procedures that could have been used to prevent it.

I have no idea how CrowdStrike stock has only dropped 10% to the values of 2 months ago. Actually, if the financial troubles you get into are only these, take back what I said, software should be failing a lot (why spend money on robustness when you don't lose money on bugs?)

notepad0x90 · 2024-07-20T05:01:55 1721451715

working in software, you should know how insanely complex software is, even google, amazon, microsoft, cloudflare and such have outages. mistakes happen because humans are involved. it is the nature and risk of depending complex systems. bridges by comparison are not that complicated.

I actually expected their stock to drop a lot more than this, but goes to show you how valuable they are. investors know that any dip is only temporary because no one is getting rid of crowdstrike.

Think of the security landscape as early 90's new york city at night and crowdstrike as the big bulky guy with lots of guns who protects you for a fee, if he makes a mistakes and hurts you, you will be mad but in the end your need for protection does not suddenly go away and it was a one time mistake.

TheOtherHobbes · 2024-07-20T15:15:01 1721488501

In which case "Are you awake and sane?" would be a sensible reality check before heading out.

You're trying to hand-wave away the inexcusable. The outage is a symptom. The problem is the lack of even the most basic testing.

Clearly these files are sent out without even a minimal sanity check. That is a problem, and it's not something that can be hand-waved away.

notepad0x90 · 2024-07-20T20:25:46 1721507146

In the 3-4 decades of the security industry, testing signature files to see if they trigger a corner case system crash has never been practiced. You and others are proclaiming yourselves to be experts in an area of technology you have no experience in. This was not a software update!!

lores · 2024-07-20T21:01:13 1721509273

Then that's 3-4 decades of massive incompetence, isn't it? "Testing before pushing an update" is basic engineering, they have a huge scale so huge responsibility, and they have the money to perform the tests and hire people who aren't entirely stupid. That's gross malpractice.

notepad0x90 · 2024-07-20T22:11:01 1721513461

testing for software, not for content. you test, and fuzz the software that processes the updates, not the content files themselves. it's like a post on HN crashing HN and you claiming HN should have tested each post before allowing it to be displayed. you test code not data, and I dare you to back up any claim that data processed by software should also be tested in the same way. Everyone is suddenly an expert in AV content updates lol.

ciuokan · 2024-07-20T22:17:10 1721513830

I used to work for Microsoft in a team adjacent to the Defender team that worked on signature updates and I know for sure that these were tested before being rolled out - I saw the Azure Devops pipelines they used to do this. If other companies aren't doing this then that's their incompetence but be assured that it's not industry-wide.

notepad0x90 · 2024-07-21T01:31:54 1721525514

I'm not saying they don't test them, I'm saying they don't do code tests, as in unit tests and all that. I have no idea what they do, I'm just speculating here, but if in fact they do no testing at all, then I agree that would be pretty bad.I would think their testing would be for how well it detects things and/or performance impact and I'd expect it to be automated deployment (i.e.: test cases are passing = gets deployed), i guess they don't have "did the system crash" check in their pipelines? In your experience at MS, did they test for system/sensor availability impact?

lores · 2024-07-21T00:02:31 1721520151

A config file IS code. And yes, even a post can theoretically break a site (SQL injection, say), so if you're pushing data to a million PCs you'd better be testing it.

notepad0x90 · 2024-07-21T01:33:05 1721525585

You're right, but "testing" could mean anything, you'd need to have the foresight to anticipate the config crashing the program. Is it common to test for that scenario with config files?

zug_zug · 2024-07-20T03:05:24 1721444724

>> You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content.

If you can't take days to do it then do a gradual rollout in hours. It's not a high bar.

notepad0x90 · 2024-07-20T03:21:25 1721445685

they reverted it after about one hour. but sure, they didn't need to target all customers all at once, that's a good point.

Marsymars · 2024-07-20T15:15:51 1721488551

> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

They certainly run their software on those many customers' systems, but but based on my experience with them, "protect" isn't a descriptor I'm willing to grant them.

We don't have the counter-factual where Crowdstrike doesn't exist, but I'm not convinced that they've been a net economic or security benefit to the world over the span of their existence.

notepad0x90 · 2024-07-20T20:29:55 1721507395

Yes, we do have a counter factual, they catch actual APT's they investigated the DNC hack in the 2016 elections and stopped many more attacks. You are utterly clueless in this area to make a comment like that honestly, I don't mean that as an insult but you are talking about a world they don't exist in as if every company has them. Most of their customers get them after getting pwned and learning their lesson the hard way. And availability isn't the only security property their customers desire, keeping information out of threat actors' hands and preventing them from tampering things is also desirable. I really hope you understand that in your hypothetical world without crowdstrike, threat actors still exist.

Marsymars · 2024-07-22T20:40:40 1721680840

> Most of their customers get them after getting pwned and learning their lesson the hard way.

Sure, that applies to my company, but the counter-factual isn't "nothing is done and we keep getting pwned", the counter-factual is that instead of the resources spent on crowdstrike and their various problems (which have been regular since we adopted them, the recent mess was just the biggest), those resources are spent on improving security infrastructure without crowdstrike.

tail_exchange · 2024-07-20T14:35:38 1721486138

Another commenter said that this change was a malformed configuration that crashed the application. If this is the case, you wouldn't need days to see this problem manifest, but only a few minutes. If they had rolled it out to 1% of their customers and waited for a couple hours before releasing it everywhere, they probably would have caught it.

dalyons · 2024-07-20T14:54:15 1721487255

A couple of hours is a long time in the world of automated attacks

RaftPeople · 2024-07-20T15:44:21 1721490261

It only takes a couple of minutes if you first update your on-site set of LIVE systems sitting there to detect a problem.

If problem encountered, don't send it out to everyone else.

mynameisvlad · 2024-07-20T15:28:20 1721489300

A couple of hours is absolutely nothing compared to the massive worldwide effort that many people have to put in to fix the problem of a company’s shitty product and release practices.

This is inexcusable, point blank. “A couple of hours is a long time” is not a valid excuse when the alternative, as clearly evidenced, is millions of computers and critical systems simultaneously failing hard.

This might have been different if it was a small subset of computers, but this clearly could have been caught in minutes with any sort of sensible testing or canary rollout practices.

notepad0x90 · 2024-07-21T01:35:56 1721525756

I'm guessing they didn't expect content updates to cause such an impact, they've been doing this for 15 years, it is that uncommon. a couple of hours in their world is a long time because their concern is protecting customers as soon as possible. I'm sure they'll do all kinds of tests going forward and be transparent about it. Keep in mind how easy it is for you or I to come to conclusions without understanding or knowing the context they operate in, maybe it will be more clear soon enough.

mardifoufs · 2024-07-20T17:43:51 1721497431

Then they should make their testing pipelines even faster, and make sure that they can go from detecting a new threat->tested definition file as quickly as possible. You genuinely cannot skimp on testing in this case. It's inherent to the update, threat protection and not breaking their consumers systems should be non-negotiable for a release. That means testing before deploying. If they can't do it fast enough, their product is broken.

dasil003 · 2024-07-20T20:05:28 1721505928

An automated attack would struggle to reach the level of destruction that this failure had due the scale of Crowdstrike deployment and the direct update vector and kernel mode failure. Even with the most critical type of remote vulnerability it would be difficult to achieve anything approaching this level of damage, and for all we know (and by all probabilities) this update was addressing a much less severe vulnerability.

TheOtherHobbes · 2024-07-20T15:09:43 1721488183

Not as long as the weeks it's going to take to undo this.

lotr5 · 2024-07-20T03:42:26 1721446946

they are dumb enough to process their "channel files" in kernel, this should be only done in usermode

vladvasiliu · 2024-07-20T03:52:09 1721447529

While I can understand both arguments for and against a gradual rollout, this is the main issue: why do these things need to be processed in kernel? And if there’s a good reason to do it, why isn’t there some kind of circuit breaker?

notepad0x90 · 2024-07-20T05:06:03 1721451963

because the thing that uses them is in kernel mode, and the sensor needs to be performant. at some point, the content must be consumed by the kernel mode sensor. user mode edr's exist but bypassing them is trivial, intercepting syscalls rootkit style and monitoring kernel+usermode memory is the best and most performant way to monitor the whole system.

calf · 2024-07-20T05:45:49 1721454349

Apple documentation argues the opposite:

"Developers can use frameworks such as DriverKit and NetworkExtension to write USB and human interface drivers, endpoint security tools (like data loss prevention or other endpoint agents), and VPN and network tools, all without needing to write kexts. Third-party security agents should be used only if they take advantage of these APIs or have a robust road map to transition to them and away from kernel extensions."

Specifically the 2nd sentence above says security software should use the APIs, not Apple's kernel extensions.

notepad0x90 · 2024-07-20T06:00:03 1721455203

well, this is windows not macos. I don't know what you can do with driverkit for example. maybe microsoft should learn from apple?

calf · 2024-07-23T06:26:00 1721715960

Your prior argument was about sensors being performant having to reside within the kernel -- a very general argument -- of which the macos provides one counterexample in its official documentation. So the problem is in your original argument.

lotr5 · 2024-07-20T03:55:01 1721447701

probably they didn't find solution where they fully trust information coming from usermode process

notepad0x90 · 2024-07-20T05:04:12 1721451852

they need to be processed in kernel mode where the monitoring happens, user mode EDRs are trivial to bypass. they have to be processed by whatever is going to use them, and in this case it is the "lightweight" sensor code in kernel mode.

acdha · 2024-07-20T13:28:08 1721482088

They need to load data into the kernel eventually but that doesn’t mean that the first time the file is parsed should be in the kernel. For example, on Linux they don’t have this problem because they use the eBPF subsystem and so what’s running in the kernel is validated byte code. Even if they didn’t want to do something that sophisticated they could simply include a validator into the update process, as has been common since the 1980s.

SkyPuncher · 2024-07-20T02:21:29 1721442089

My understanding is they basically deployed a configuration file. It seems like these files might be akin to virus signatures or other frequently updated run-time configuration.

I actually don't think it's outrageous that these files are rolled out globally, simultaneously. I'm guessing they're updated frequently and _should_ be largely benign.

What stands out to me is the fact that a bad config file can crash the system. No rollback mechanism. No safety checks. No safe, failure mode. Just BSOD.

Given the fix is simply deleting the broken file, it's astounding to me that the system's behavior is BSOD. To me, that's more damning that a bad "software update". These files seem to change often and frequently. Given they're critical path, they shouldn't have the ability to completely crash the system.

joshka · 2024-07-20T04:29:18 1721449758

> I actually don't think it's outrageous that these files are rolled out globally, simultaneously.

Anyone competent that manages software at scale should generally hold the opposite opinion to this.

Analemma_ · 2024-07-20T02:26:09 1721442369

That’s the danger of running in kernel mode. I’ve seen some people claim this is because the bad file starts a chain of events which concludes in trying to page an unpageable file, which is an application crash in user space but brings down the whole system if it happens in the kernel.

SkyPuncher · 2024-07-20T02:31:32 1721442692

That seems like programming 101 for these systems.

In the past, I've worked around this by validating the configuration of a file before attempting to run it. You bail out in a safe way during validation, but still allow a hard error during run time.

Doesn't prevent all misconfigured files, but prevents the stuff like.

acdha · 2024-07-20T13:32:27 1721482347

I think it was in the early 90s when I first saw something do A/B style loading where it would record the attempt to load something, recognize that it hadn’t finished, and use the last known good config instead. Anyone studying high-availability systems has a wealth of prior art to learn from.

userbinator · 2024-07-20T03:10:24 1721445024

I think all programmers should have the experience of using and developing on a single-address-space OS with absolutely no protections like DOS, just to encourage them to improve their skills at writing better, actually correct code. When the smallest bugs will crash your system and cause you to lose work, you tend to be a lot more careful with thinking about what your code does instead of just running it to see what happens.

Gigachad · 2024-07-20T03:36:45 1721446605

Suggesting “Being more careful” never solves these issues because eventually someone somewhere will have a momentary slip up that causes this.

The real takeaway is that we need to design systems so this kind of issue is less possible. Put less code in the kernel, use tools that prevent these kinds of issues, design computers that can roll back the system if they crash.

YZF · 2024-07-20T03:04:35 1721444675

Perfect example of where instrumentation guided fuzzing like AFL would almost certainly have found a problem.

I agree with the amateur hour observation. But then most things seem to be.

opello · 2024-07-20T04:38:16 1721450296

Entertainingly enough I got to see a similar thing happen, where a configuration file was killing hardware in the field. After the failure and remediation multiple CI jobs were put in place (some months later) to do basic validity checks on the files.

The lesson of "multiple parser implementations for the same thing seems bad" and "sanity checks to prevent breaking things are hard heuristics to define" such that further changes were deferred.

All that to say that I can appreciate circumstances in which satisfying "don't crash the system" in response to configuration data can actually be fairly hard to realize. It can very significantly depend on the design of the pieces in question. But I also agree that it's pretty damning.

userbinator · 2024-07-20T02:20:28 1721442028

I'm more surprised at the fact that they didn't appear to have tested it on themselves first.

FWIW, at least Microsoft still "dogfoods" (and it's what coined that term), and even if the results of that aren't great, I'm sure they would've caught something of this severity... but then again, maybe not[1].

[1] https://news.ycombinator.com/item?id=18189139

Ekaros · 2024-07-20T05:36:13 1721453773

This is what really would concern me too. With this wide spread issue any reasonable testing should have detected it. Having a few dozen machines with different configurations for an few hours should have detected this. This should have been in a smoke test.

Push update to machines, observe, power cycle them, observe...

I could understand error in some rarer setup, but this was so common that it should have been obvious error.

Zamiel_Snawley · 2024-07-20T02:13:58 1721441638

Truly, how the extent the damage was so widespread is my main question at this point.

Everyone has a buggy release at some point, but impacting global customers at this level is damn near unforgivable.

Heads need to roll for this oversight.

Murky3515 · 2024-07-20T02:16:50 1721441810

Because if release immediately, velocity go up

jefurii · 2024-07-20T07:05:40 1721459140

I have a friend who is a security guard at a bank in Hollywood, CA, who told me the computers at his location started going down between 12:00 and 13:00PDT (19:00-20:00UTC).

I don't understand CrowdStrike's rollout system, but given that people started seeing trouble earlier in the day, surely by that time they could have shut down the servers that were serving the updates, or something??

He also told me that soon after that the street outside the bank (another bank across the street, a hospital several blocks down) was lined with police who started barring entry to the buildings unless people had bank cards. By the time I woke up this morning technical people already knew basically what was going on, but I really underestimated how freaked out the average person must have been today.

rdtsc · 2024-07-20T14:32:50 1721485970

> The update that occurred at 04:09 UTC was designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks

The obvious joke here is CS runs the malicious C2 framework. So the system worked as designed: it prevented further execution and quarantined the affected machines.

But given they say that’s just a configuration file (then why the hell is it suffixed with .sys?), it’s actually plausible. A smart attacker could disguise themselves and use the same facilities as the CS. CS will try to block them and blocks itself in the process?

nonfamous · 2024-07-20T14:50:53 1721487053

>>> Systems that are not currently impacted will continue to operate as expected, continue to provide protection, and have no risk of experiencing this event in the future.

Given that this incident has now happened twice in the space of months (first on Linux, then on Windows), and that as stated in this very post the root cause analysis is not yet complete, I find that statement of “NO RISK” very hard to believe.

ungreased0675 · 2024-07-20T02:44:35 1721443475

This seems very unsatisfying. Not sure if I was expecting too much, but that’s a lot of words for very little information.

I’d like more information on how these Channel Files are created, tested, and deployed. What’s the minimum number of people that can do it? How fast can the process go?

hatsunearu · 2024-07-20T02:05:20 1721441120

I'm not a big expert but honestly this read like a bunch of garbage.

> Although Channel Files end with the SYS extension, they are not kernel drivers.

OK, but I'm pretty sure usermode software can't cause a BSOD. Clearly something running in kernel mode ate shit and that brought the system down. Just because a channel file not in kernel mode ate shit doesn't mean your kernel mode software isn't culpable. This just seems like a sleezy dodge.

gjm11 · 2024-07-20T02:14:08 1721441648

It doesn't read to me as trying to dodge anything. They aren't saying "they're not kernel drivers, so everything is OK", they're saying "seeing the .sys on the filenames, you might think they're kernel drivers, but as it happens they're something else".

(Maybe there's some subtext that I'm missing, but I don't see how saying "these aren't kernel drivers" makes them look any better, and I do see why they might say it to be informative, so it looks like to me like they're doing the latter.)

hatsunearu · 2024-07-20T02:18:18 1721441898

> It doesn't read to me as trying to dodge anything.

It absolutely reads like this. They are getting blasted online for shipping kernel mode driver updates without proper QA and release engineering. Which just from face value just seems like some insano style engineering. They are saying "it's not actually a kernel mode value" to deflect blame.

I mean, I really don't understand why they would make this statement otherwise. If they are innocently just trying to say "this is just a channel file", there are other ways to say this, and it really isn't relevant enough to underline and emphasize.

bostik · 2024-07-20T15:43:06 1721490186

Friend does incident response and Windows forensics, and pointed something (in retrospect) rather obvious out yesterday: the instructions for cleaning up simply told people to "delete .SYS files according to this wildcard". No additional context.

That caught his eye, because to him it sounded like madness. Apparently deleting random driver files is a fairly well known way to screw a Windows system up even more than it already was.

This statement from CS must have gone through legal and PR review, so we have to assume every word and statement has been carefully vetted from a cover-your-backside perspective. It is light on information content, but there must be reason for them to so forcefully telegraph that the files being deployed (and removed) are not themselves drivers.

dizhn · 2024-07-20T20:30:32 1721507432

They said to delete a single specific file. Did what you're saying happen before that or something?

bostik · 2024-07-20T21:12:21 1721509941

The original instructions were "delete [something]00291-*.sys"

gjm11 · 2024-07-20T14:20:49 1721485249

They're getting blasted for causing a massive worldwide outage due to what is clearly inadequate quality control. I don't see why this is any better if it's "pushed a kernel-mode driver update with bugs in it" than if it's "released a product with buggy kernel-mode stuff that can be made to crash by an innocuous-looking data file, and then pushed a data file that made it crash". Same result either way. Same demonstration of inadequate quality control either way.

I think the story they're telling now, which so far as I know is the truth, looks worse for them, because it requires them to have screwed up their QC twice. Once when they made a product that do such bad things, and once when they pushed the data file to millions of PCs without checking what it did.

So I still don't see how "this particular file happens not to be kernel-mode code" makes them look any better, and therefore I don't see why they'd be saying it "to deflect blame". It doesn't deflect blame; they look just as bad either way.

mynameisvlad · 2024-07-20T15:38:14 1721489894

You may understand it that way, but you also have a much deeper knowledge of this than the targeted audience of the RCA.

Make no mistake, this RCA was not published for technical folks. The only reason it’s even published is to make their customers feel more secure. You and I are not their customers; high level management and executives are.

SoftTalker · 2024-07-20T02:11:06 1721441466

The kernel driver reads the channel files. It choked on this one, and crashed.

SAI_Peregrinus · 2024-07-20T02:25:50 1721442350

Which implies that any malware capable of replacing these channel files can crash their kernel driver. I wonder if there's a non-crashing way to exploit this & get kernel-space code execution.

amiga386 · 2024-07-20T15:56:37 1721490997

Any malware capable of modifying files under C:\Windows\System32 has no need to fiddle with these files because to have that capability means it already got the keys to the kingdom and could wreck the system in a billion different ways.

See "It rather involved being on the other side of this airtight hatchway" https://devblogs.microsoft.com/oldnewthing/20200420-00/?p=10...

SAI_Peregrinus · 2024-07-23T16:14:33 1721751273

The directory they're stored in needs Administrator access, but the kernel runs with SYSTEM level permissions. Administrator is an account, SYSTEM is a security principal. SYSTEM level processes can access domain servers in the context of the computer's domain account, while Administrators can't do so unless they provide explicit credentials (or share a password with an Administrator account on the domain). So this could be used as a way to elevate access from local Administrator up to whatever that computer can do on a connected domain server!

SoftTalker · 2024-07-20T17:42:50 1721497370

And yet the cleanup instructions were for the user to delete a file in that directory. That requires booting into safe mode, but if any random user is able to do that, kiss your systems goodbye, a good social engineer (or disgruntled employee) will own any desktop in your organization if he wants to.

amiga386 · 2024-07-20T20:18:15 1721506695

The point is, malware can't get into that directory without user consent. Having physical access to the machine, rebooting into safe mode and running commands is a stonking big user consent.

I can pwn my own desktop, yes, all I have to do is say "run as administrator". But the point of the security boundary is to make it impossible for software to get these privileges without me actively giving it to them.

If you're shifting the goalposts and imagining the computer does not belong to me, but to an organisation that I'm a mere employee of, they'll be using AD Group Policy to control what I can and can't do, and Bitlocker to encrypt the boot drive. I cannot boot into safe mode without having the tech support department give me a special code to unlock the computer. Again, that's how you get on the other side of the airtight hatch.

naib0930 · 2024-07-21T20:11:20 1721592680

In my organizations any user couldn't do it, we have to manually touch every computer and enter the bit locker key. We lost in the neighborhood of 14,000 end points, every single one needs touched. My team of 10 did about 800 in 5 hours. Pulling and entering the bitlocker key was what took the longest.

joshka · 2024-07-20T04:35:51 1721450151

If you have write access a path like C:\Windows\System32\drivers\CrowdStrike\ (and I'd assume the parent directory), then you pretty much can crash the kernel many ways.

If you have the means to insert an AV config file update in between the config servers and the user's host then you probably can PWN the system pretty easily as well.

What this probably does mean is that Crowdstrike will be receiving some attention from hackers of both hat colors. Here's the bug bounty page ... https://hackerone.com/crowdstrike?type=team

numbsafari · 2024-07-20T02:43:52 1721443432

I can guarantee you that you aren't the only one thinking this right now.

ytch · 2024-07-20T04:32:45 1721449965

IIRC, there were some security software exploits, that trigger RCE (or DoS) of the scanning engine by malicious file?

teeray · 2024-07-20T02:23:02 1721442182

The kernel driver is an interpreter that executed an HCF instruction from the channel file.

hatsunearu · 2024-07-20T02:13:51 1721441631

That's what I thought. So saying "it's not a kernel mode driver" is technically true, but I don't need to explain why it's a bunch of nonsense to try to damage control their incompetence.

epcoa · 2024-07-20T02:47:58 1721443678

No idea why you’re getting downvoted. A configuration file for code that runs in kernel space is usually effectively kernel code (it certainly was in this case) - obviously there are formal methods to allow kernel code to be configured in a “safe” fashion, but it’s obvious that’s not going on here.

patrickthebold · 2024-07-20T02:33:21 1721442801

>The configuration update triggered a logic error that resulted in an operating system crash.

> We understand how this issue occurred and we are doing a thorough root cause analysis to determine how this logic flaw occurred.

There's always going to be flaws in the logic of the code, the trick is to not have single errors be so catastrophic.

chris_nielsen · 2024-07-20T02:38:33 1721443113

Yeah “how this logic flaw occurred” is the wrong question.

How a common bug was rolled out globally with no controls, testing, or rollback strategy is the right question

YZF · 2024-07-20T03:09:00 1721444940

They're all good questions. The thing that reads the config should have been fuzz tested with something like AFL. Likely should have a lot more tests. Maybe shouldn't run in a device driver. There's almost no doubt there are engineering process and culture issues here.

And then absolutely the release process.

Rollback is hard I guess once your OS can't boot.

cube00 · 2024-07-20T16:10:32 1721491832

> Rollback is hard I guess once your OS can't boot.

This is why the client needs have enough error handling to realise it's latest update has now caused unsuccessful boot and roll that update back locally to the last known good configuration (or completely back to factory and pull all updates again).

pneumonic · 2024-07-20T03:09:04 1721444944

> we are doing a "root cause analysis to determine how this logic flaw occurred"

That's going to find a cause: a programmer made an error. That's not the root of the problem. The root of the problem is allowing such an error to be released (especially obvious because of its widespread impact).

kyriakos · 2024-07-20T05:23:10 1721452990

Why is everyone blaming Microsoft? Is this something of an oversight in their side too? Can someone explain?

cyrnel · 2024-07-20T16:24:27 1721492667

I'm no kernel expert, but people are saying Microsoft deserves some blame for not exposing necessary functionality to user space, requiring the use of a very-unsafe kernel driver.

Linux provides eBPF and macOS provides system extensions.

I'll also add that Windows itself heavily prioritizes backwards-compatibility over security, which leads companies to seek out third-party solutions for stopping malware instead of design-based mitigations being built into Windows.

mardifoufs · 2024-07-20T17:50:15 1721497815

I don't agree. I'm glad Microsoft doesn't provide the functionality to do what crowdstrike does to user space. Crowdstrike acts in a similar way to deeply seated malware, except that it is usually installed voluntarily. But the behavior and capabilities that it has are basically what any malware would dream of, and exposing them to user space would imo create a mess (especially on windows). If anything, this is good as it will make people even more weary of kernel mode software.

And I'm not sure epbf actually allows you to do a lot of the stuff crowdstrike-like software does. I know they use it on Linux though so maybe eBPF has evolved a lot since I last looked at it.

cyrnel · 2024-07-20T17:54:23 1721498063

I generally agree with you. It's an either-or thing: either Microsoft secures their OS, or they provide safe ways for users to secure their OS. The first option is a million times better, but having neither option leads us to this mess.

sgammon · 2024-07-20T05:28:29 1721453309

For letting a failure of this magnitude be possible, I suspect

jchiu1106 · 2024-07-20T02:08:20 1721441300

Where are the technical details?

augustk · 2024-07-22T10:50:24 1721645424

This is proprietary (closed-source) software as far as I understand.

Zamiel_Snawley · 2024-07-20T02:14:44 1721441684

Yeah, this PR statement is pretty much devoid of information other than it is not a cyber attack.

isthisreallife2 · 2024-07-20T04:18:05 1721449085

So - a malformed configuration is capable of crashing a kernel process. Sounds very exploitable. Very

canistel · 2024-07-20T03:15:35 1721445335

> This issue is not the result of or related to a cyberattack.

Must be corrected to "the issue is not the result of or related to a cyberattack by external agents".

geuis · 2024-07-20T03:09:15 1721444955

Weak.

Very weak and over corporate level of ass covering. And it doesn't even come close to doing that.

They should just let the EM of the team involved provide a public detailed response that I'm sure is floating around internally. Just own the problem and address the questions rather than trying to play at politics, quite poorly.

0nate · 2024-07-20T23:11:02 1721517062

The lower you go in system architecture, the greater the impact when defects occur. In this instance, the Crowdstrike agent is embedded within the Windows Kernel, and registered with the Kernel Filter Engine illustrated in the diagram below.

https://www.nathanhandy.blog/images/blog/OSI%20Model%20in%20...

If the initial root cause analysis is correct, Crowdstrike has pushed out a bug that could have been easily stopped had software engineering best practices been followed: Unit Testing, Code Coverage, Integration Testing, Definition of Done.

automatoney · 2024-07-20T03:01:40 1721444500

To my biased ears it sounds like these configuration-like files are a borderline DSL that maybe isn't being treated as such. I feel like that's a common issue - people assume because you call it a config file, it's not a language, and so it doesn't get treated as actual code that gets interpreted.

bryan_w · 2024-07-20T06:18:22 1721456302

It kinda feels like someone added a watch for c:\COM\COM like we did back in the day on AOL

timbelina · 2024-07-20T06:06:44 1721455604

Can someone aim me at some RTFM that describes the sensor release and patching process, please? I'm lost trying to understand: When a new version 'n' of the sensor is released, we upgrade a selected batch of machines and do some tests (mostly waiting around :-)) to see that all is well. Then we upgrade the rest of the fleet by OU. However, 'cause we're scaredy cats, we leave some critical kit on n-1 for longer. And some really critical kit even on n-2. (Yeah, there's a risk in not applying patches I know but there are other outage-related risks that we balance; forget that for now) Our assumption is that n-1, n-2, etc are old, stable releases, and so when fan and shit collided yesterday, we just hopped on the console and did a policy update to revert to n-2 and assumed we'd dodged the bullet. But of course, that failed... you know what they say about assumptions :-) So in a long-winded way that leads to my three questions: Why did the 'content update' take out not just n but n-whatever sensors equally as effectively? Are the n-whatever versions not actually stable? And if the n-whatever versions are not actually stable and are being patched, what's the point of the versioning? Cheers!

xyst · 2024-07-20T02:44:55 1721443495

“Technical” detail report reads more like a lawyer generated report. This company is awful.

If I ever get a sales pitch from these shit brains, they will get immediately shut down.

Also fuck MS and their awful operating system that then spawned this god awful product/company known as “CrowdStike Falcon”

robjan · 2024-07-20T03:55:19 1721447719

You are probably not the target market of this product then. The real product CrowdStrike Falcon sells is regulatory compliance and it's a defacto requirement in many regulated industries including banking.

By the way, Falcon can be and is deployed to Linux and MacOS hosts in these organisations too it's just that this particular incident only affected Windows.

hello_moto · 2024-07-20T14:59:26 1721487566

2 things:

1. critical infrastructure around the globe seemed to depend on CrowdStrike

2. "If I ever get a sales pitch from..." suggested you are in an environment that is far from critical infrastructure.

userbinator · 2024-07-20T02:50:55 1721443855

If Windows wasn't as popular, then this might've happened to Linux to macOS instead. Blame CrowdStrike's incompetence, not MS.

acdha · 2024-07-20T13:37:09 1721482629

It couldn't happen on macOS: Apple stopped letting third-parties run code in the kernel after years of failures like this.

It also wouldn't happen on Linux: they use eBPF there which was designed by grownups and validates its inputs.

mynameisvlad · 2024-07-20T15:46:51 1721490411

eBPF exists on Windows, too: https://microsoft.github.io/ebpf-for-windows/

They’re just not using it. They could have not used it for Linux too. The presence of the feature is not enough to guarantee this would’ve never happened in a hypothetical.

acdha · 2024-07-20T17:00:38 1721494838

No, the fact that they’re actually using eBPF on Linux is what makes it safer. None of this is magic, it’s just a question of following decades of engineering experience.

Similarly, Microsoft clearly sees the benefits but note that they themselves say that’s not production ready yet. I’m certain that this incident will cause people to consider migrating as soon as that changes.

mynameisvlad · 2024-07-20T17:41:18 1721497278

You’re responding to a hypothetical, not what happened.

Let’s say Linux is the leading OS around the world. How can we be sure that they would actually use eBPF if this was the case?

They would likely choose the fastest option in order to support the platform as quickly as possible. Perhaps eBPF didn’t even exist if they prioritized Linux support and implemented that first, since Falcon was first released in 2013 and eBPF in 2014.

Switching from kernel mode to eBPF would be quite a lift, so if it wasn’t baked in from the start it likely wouldn’t have been added in after the fact.

A decade worth of changes is a lot to confidently say what would have happened. If Linux and MacOS were more popular than Windows, it could have been completely different.

This doesn’t even touch on the massive Debian incident CS had earlier this year, which is not a hypothetical.

acdha · 2024-07-20T18:57:57 1721501877

They are using eBPF right now. That suggests that they, like everyone else, see benefits in using a platform feature when it exists.

kchr · 2024-07-20T20:46:11 1721508371

Last time I checked, CS primarily runs in kernel mode on Linux and only fall back to eBPF if the kernel version is not supported. When in eBPF mode, they call it "Reduced Functionality Mode (RFM)".

Has this changed?

j16sdiz · 2024-07-20T15:09:38 1721488178

Kext still exist on macos

acdha · 2024-07-20T17:07:16 1721495236

Kind of: they’ve been deprecated for 4 years and you have to disable SIP to load them.

https://developer.apple.com/support/kernel-extensions/

They’ve added system extension mechanisms for the most common needs trying to balance the various things people use kexts for against the impact on security, performance, and reliability many kexts had.