Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This "channel file" is equivalent to an AV signature file. Crowdstrike is the company, the product here is "Falcon" which does behavioral monitoring of processes both on the device and using logs collected from the device in the cloud.

I can see your perspective, but you should consider this: They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content. Although the root cause is on the channel files, I feel like the driver that processes them should have been able to handle the "logic bug" in question so we'll find out more over time I guess.

For example, with windows defender which runs on virtually all windows systems, the signature updates on billions of devices are pushed immediately (with exception to enterprise systems, but even then there is usually not much testing on signature files themselves, if at all). As far as the devops process Crowdstrike uses to test the channel files, I think it's best to leave commentary on that to actual insiders but these updates happen several times a day sometimes and get pushed to every Crowdstrike customer.




> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation

I certainly don't want to know (through disaster news) about the construction company that built the bridge I drive through everyday, not for another 15 years, not ever!

This kind of software simply should not fail, with such a massive install base on so many sensitive industries. We're better than that, the software industry is starting to mature and there are simple and widely-known procedures that could have been used to prevent it.

I have no idea how CrowdStrike stock has only dropped 10% to the values of 2 months ago. Actually, if the financial troubles you get into are only these, take back what I said, software should be failing a lot (why spend money on robustness when you don't lose money on bugs?)


working in software, you should know how insanely complex software is, even google, amazon, microsoft, cloudflare and such have outages. mistakes happen because humans are involved. it is the nature and risk of depending complex systems. bridges by comparison are not that complicated.

I actually expected their stock to drop a lot more than this, but goes to show you how valuable they are. investors know that any dip is only temporary because no one is getting rid of crowdstrike.

Think of the security landscape as early 90's new york city at night and crowdstrike as the big bulky guy with lots of guns who protects you for a fee, if he makes a mistakes and hurts you, you will be mad but in the end your need for protection does not suddenly go away and it was a one time mistake.


In which case "Are you awake and sane?" would be a sensible reality check before heading out.

You're trying to hand-wave away the inexcusable. The outage is a symptom. The problem is the lack of even the most basic testing.

Clearly these files are sent out without even a minimal sanity check. That is a problem, and it's not something that can be hand-waved away.


In the 3-4 decades of the security industry, testing signature files to see if they trigger a corner case system crash has never been practiced. You and others are proclaiming yourselves to be experts in an area of technology you have no experience in. This was not a software update!!


Then that's 3-4 decades of massive incompetence, isn't it? "Testing before pushing an update" is basic engineering, they have a huge scale so huge responsibility, and they have the money to perform the tests and hire people who aren't entirely stupid. That's gross malpractice.


testing for software, not for content. you test, and fuzz the software that processes the updates, not the content files themselves. it's like a post on HN crashing HN and you claiming HN should have tested each post before allowing it to be displayed. you test code not data, and I dare you to back up any claim that data processed by software should also be tested in the same way. Everyone is suddenly an expert in AV content updates lol.


I used to work for Microsoft in a team adjacent to the Defender team that worked on signature updates and I know for sure that these were tested before being rolled out - I saw the Azure Devops pipelines they used to do this. If other companies aren't doing this then that's their incompetence but be assured that it's not industry-wide.


I'm not saying they don't test them, I'm saying they don't do code tests, as in unit tests and all that. I have no idea what they do, I'm just speculating here, but if in fact they do no testing at all, then I agree that would be pretty bad.I would think their testing would be for how well it detects things and/or performance impact and I'd expect it to be automated deployment (i.e.: test cases are passing = gets deployed), i guess they don't have "did the system crash" check in their pipelines? In your experience at MS, did they test for system/sensor availability impact?


A config file IS code. And yes, even a post can theoretically break a site (SQL injection, say), so if you're pushing data to a million PCs you'd better be testing it.


You're right, but "testing" could mean anything, you'd need to have the foresight to anticipate the config crashing the program. Is it common to test for that scenario with config files?


>> You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content.

If you can't take days to do it then do a gradual rollout in hours. It's not a high bar.


they reverted it after about one hour. but sure, they didn't need to target all customers all at once, that's a good point.


> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

They certainly run their software on those many customers' systems, but but based on my experience with them, "protect" isn't a descriptor I'm willing to grant them.

We don't have the counter-factual where Crowdstrike doesn't exist, but I'm not convinced that they've been a net economic or security benefit to the world over the span of their existence.


Yes, we do have a counter factual, they catch actual APT's they investigated the DNC hack in the 2016 elections and stopped many more attacks. You are utterly clueless in this area to make a comment like that honestly, I don't mean that as an insult but you are talking about a world they don't exist in as if every company has them. Most of their customers get them after getting pwned and learning their lesson the hard way. And availability isn't the only security property their customers desire, keeping information out of threat actors' hands and preventing them from tampering things is also desirable. I really hope you understand that in your hypothetical world without crowdstrike, threat actors still exist.


> Most of their customers get them after getting pwned and learning their lesson the hard way.

Sure, that applies to my company, but the counter-factual isn't "nothing is done and we keep getting pwned", the counter-factual is that instead of the resources spent on crowdstrike and their various problems (which have been regular since we adopted them, the recent mess was just the biggest), those resources are spent on improving security infrastructure without crowdstrike.


Another commenter said that this change was a malformed configuration that crashed the application. If this is the case, you wouldn't need days to see this problem manifest, but only a few minutes. If they had rolled it out to 1% of their customers and waited for a couple hours before releasing it everywhere, they probably would have caught it.


A couple of hours is a long time in the world of automated attacks


It only takes a couple of minutes if you first update your on-site set of LIVE systems sitting there to detect a problem.

If problem encountered, don't send it out to everyone else.


A couple of hours is absolutely nothing compared to the massive worldwide effort that many people have to put in to fix the problem of a company’s shitty product and release practices.

This is inexcusable, point blank. “A couple of hours is a long time” is not a valid excuse when the alternative, as clearly evidenced, is millions of computers and critical systems simultaneously failing hard.

This might have been different if it was a small subset of computers, but this clearly could have been caught in minutes with any sort of sensible testing or canary rollout practices.


I'm guessing they didn't expect content updates to cause such an impact, they've been doing this for 15 years, it is that uncommon. a couple of hours in their world is a long time because their concern is protecting customers as soon as possible. I'm sure they'll do all kinds of tests going forward and be transparent about it. Keep in mind how easy it is for you or I to come to conclusions without understanding or knowing the context they operate in, maybe it will be more clear soon enough.


Then they should make their testing pipelines even faster, and make sure that they can go from detecting a new threat->tested definition file as quickly as possible. You genuinely cannot skimp on testing in this case. It's inherent to the update, threat protection and not breaking their consumers systems should be non-negotiable for a release. That means testing before deploying. If they can't do it fast enough, their product is broken.


An automated attack would struggle to reach the level of destruction that this failure had due the scale of Crowdstrike deployment and the direct update vector and kernel mode failure. Even with the most critical type of remote vulnerability it would be difficult to achieve anything approaching this level of damage, and for all we know (and by all probabilities) this update was addressing a much less severe vulnerability.


Not as long as the weeks it's going to take to undo this.


they are dumb enough to process their "channel files" in kernel, this should be only done in usermode


While I can understand both arguments for and against a gradual rollout, this is the main issue: why do these things need to be processed in kernel? And if there’s a good reason to do it, why isn’t there some kind of circuit breaker?


because the thing that uses them is in kernel mode, and the sensor needs to be performant. at some point, the content must be consumed by the kernel mode sensor. user mode edr's exist but bypassing them is trivial, intercepting syscalls rootkit style and monitoring kernel+usermode memory is the best and most performant way to monitor the whole system.


Apple documentation argues the opposite:

"Developers can use frameworks such as DriverKit and NetworkExtension to write USB and human interface drivers, endpoint security tools (like data loss prevention or other endpoint agents), and VPN and network tools, all without needing to write kexts. Third-party security agents should be used only if they take advantage of these APIs or have a robust road map to transition to them and away from kernel extensions."

Specifically the 2nd sentence above says security software should use the APIs, not Apple's kernel extensions.


well, this is windows not macos. I don't know what you can do with driverkit for example. maybe microsoft should learn from apple?


Your prior argument was about sensors being performant having to reside within the kernel -- a very general argument -- of which the macos provides one counterexample in its official documentation. So the problem is in your original argument.


probably they didn't find solution where they fully trust information coming from usermode process


they need to be processed in kernel mode where the monitoring happens, user mode EDRs are trivial to bypass. they have to be processed by whatever is going to use them, and in this case it is the "lightweight" sensor code in kernel mode.


They need to load data into the kernel eventually but that doesn’t mean that the first time the file is parsed should be in the kernel. For example, on Linux they don’t have this problem because they use the eBPF subsystem and so what’s running in the kernel is validated byte code. Even if they didn’t want to do something that sophisticated they could simply include a validator into the update process, as has been common since the 1980s.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: