Not sure what questions Microsoft have to answer. A third-party vendor shipped d...

Daviey · 2024-07-19T15:45:04.000000Z

Because an essential enterprise security application was /able/ to bring down an entire OS like this. The issue is that Microsoft doesn't provide an interface for an application to operate in user-space to have the functionality it requires.

Linux has eBPF which can provide most of the capability that Crowdstrike needs, by using an "in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively". If MS had this functionality, it is likely this incident would not have happened.

That said, from personal experience on Linux it's been an extremely long time since a bad kernel module has rendered a system entirely FUBAR'd.

(To Microsoft credit, they have begun copying the eBPF methodoloy to Windows, but it is still in it's infancy https://github.com/Microsoft/ebpf-for-windows/ ).

jcranmer · 2024-07-19T15:57:19.000000Z

It's possible for a badly-written eBPF policy to prevent any application from starting up, AIUI, so that's more or less the same situation isn't it?

keneda7 · 2024-07-19T16:22:42.000000Z

Crowdstrike brought linux machines down earlier this year in April. There* are several posts in this thread about it.

netdevnet · 2024-07-24T08:45:10.000000Z

> Linux has eBPF which can provide most of the capability that Crowdstrike needs, by using an "in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively". If MS had this functionality, it is likely this incident would not have happened.

It didn't stop Linux machines from being down so it is clearly not as easy as you put it. The reality is that writing software is hard yet devs often trivialise it to their own detriment

Daviey · 2024-07-24T08:55:20.000000Z

The issue I am raising is /design/, not /development/. The current model of unconstrained unforgiving highly privileged execution space is a bad design, that is what eBPF tries to address.

netdevnet · 2024-07-24T12:42:27.000000Z

It didn't make a different though. Linux still went down so clearly the design is enough

Daviey · 2024-07-24T13:22:22.000000Z

It is a different issue[0]. The Linux issue from April was a Linux Kernel bug[1], that CS Falcon happened to trigger. The design to use eBPF is sound, but the implementation on the kernel side had a bug.

Also, CS Falcon didn't support RHEL 9.4 (only up to 9.3), so for this specific bug you highlighted, CS should not be held accountable for regression testing, because it was a platform they did not support.

With Windows, the design is currently poor to not be able to run code in a safe manner. Most recently, it appears MS is blaming the EU for forcing them to create an interface for services such as CS to run[2]. Rather than lean into the problem and create a good design, they didn't create security boundaries - risking the entire system.

Bugs happen, and Linux will continue to harden and be more resilient - but unless MS focussed on secure design in this area, things like this will continue to happen (same as they have with AV before).

  [0] https://access.redhat.com/solutions/7068083
  [1] https://access.redhat.com/errata/RHSA-2024:3306
  [2] https://www.forbes.com/sites/davidphelan/2024/07/22/crowdstrike-outage-microsoft-blames-eu-while-macs-remain-immune/

politelemon · 2024-07-19T15:48:24.000000Z

Might be editorialised by op or sky changed the title, it is currently:

"Serious questions to answer after what could be the biggest IT outage in history"

landr0id · 2024-07-19T15:51:16.000000Z

Assuming Sky since the URL slug shows "Microsoft"

landr0id · 2024-07-19T15:50:09.000000Z

>Not sure what questions Microsoft have to answer.

The only thing I could think of is if it was a driver update, the driver has to be "WHQL" signed. WHQL stands for "Windows Hardware Quality Lab" -- what quality are they ensuring? (spoiler alert from my time at Microsoft: it's not terribly robust :p )

It's not realistic for Microsoft to test drivers in a manner that represents real-world usage, but perhaps they need to start doing some basic "it works with whatever integrated agent/etc is required" testing as a requirement for signing a driver.

If it was a user-mode update? Yeah no real fault on Microsoft here.

KHRZ · 2024-07-19T15:59:13.000000Z

From what I heard Crowdstrike just updated their DB file, which means the bug was alreadyq there, waiting for someone to trigger it with a "low risk" quick roll out.

ghthor · 2024-07-20T05:10:01.000000Z

So kind of like the xz exploit, carefully placed and laying in wait.

I only hope this was a good guy move by someone to knock a placed chess piece off the board.

drpossum · 2024-07-19T15:43:58.000000Z

You're confusing the Crowdstrike issue with Azure being down. Microsoft is ultimately responsible for anything regarding Azure even if it was a vendor that did something wrong because they choose their vendors

danbruc · 2024-07-19T15:45:56.000000Z

The article is about CrowdStrike incident and not the Azure configuration issue.