> We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage.
Why would multiple reboots make sense? I can accept three reboots triggering some condition that tries three times before it stops trying, but fifteen?
It sounds like a race condition; you want the CrowdStrike updater to start and pull down the fix before the affected virus definition file is loaded and kills the box.
If you keep rebooting, you eventually may get those to load in the right order.
Anyway, had that confirmed by whoever is in charge of security and whatnot, as we have an internal StackOverflow clone and someone asking the same question was pointed to a PowerPoint presentation of dos and don'ts.
I'm not sure if this has anything to do with it; but I got an email this morning at 2 AM saying that my Microsoft account password was changed. The email was authentic - it came from Microsoft's servers, and had no buttons for me to click. It said that the IP Address of the password change was my own ISP in my area... and that the reset came from a phone number I didn't recognize on my security info, that even now does not show on my account dashboard.
I ran in a panic to reset my password... only to discover, the password was "changed" to the exact same password? And how would they even get in the account without 2FA? My "sign in history" also showed no trace of anything unusual. At this point, and reading these headlines, I feel more confident something's broke.
The Crowdstrike issue is causing the Azure issue as I understand, lots of machines using that software that all updated to the blue screen feature around 6 PM apparently.
Furthermore, the article states that it was "affecting customers across the Central US region". Doesn't it seem like they should have used "5:00 PM CDT" in the article?
This entire thing is not surprising but it's fascinating to see unfold. In many dimensions, even listening to the wild speculation or weird reporting around it.
Antivirus software always works as a driver in the kernel, no other way. You'll get the same in Linux, for example. In MacOS it may be slightly better (if I remember right Darwin is a micro-kernel), but in fact a broken driver still can crash the system there.
This page is only about using some APIs, that are now supposed to be called through wrappers. I would say it significantly limits the developers, and also may introduce additional flaws.
Seems like microsoft is getting put into the headlines because the Crowdstrike versions for mac and linux aren't affected. Plus Crowdstrike has a history of pushing people onto microsoft tech support when their software causes problems.
Microsoft should have a certification program for software that messes around with kernel modules that could cause a BSOD. I guess they already do this for hardware drivers.
Once software is effecting a certain number of endpoints the vendor has to prove they are testing sufficiently.
Crowdstrike has caused kernel panics on Linux systems in the past too. I push FOSS hard and will hold that this is an example where using FOSS for all critical business software would have saved companies, but I don't particularly blame Microsoft directly for this outage. Incompetent IT managers buying software from their golf buddies is the real heart of the issue.
I don't blame Microsoft in the tactical sense for this outage, but I blame them in the strategic sense. Here's what I mean.
Microsoft is all in on kernel extensions, when Apple has shown that you can deprecate them and move the most important use cases outside the kernel. I blame Microsoft for not starting the herculean task of deprecating kernel extension. Remember, Satya Nadella said recently that Microsoft will put security above everything else, even backward compatibility. Then the Recall fiasco happened and Nadella was caught pants down with the useless value of his word.
Apple gets away with a lot of things MS would never get away with, so I am not sure that saying "apple did it" is a good argument. It still may be a good idea to move away from kernel modules, I would tend to agree with that in general. But there is a whole class of problems that come from allowing third parties to push updates to your critical IT systems- kernel or not.
You’re right that Apple can get away with things Windows can’t. But in the case of kernel extensions, I think Apple has shown a valid path Microsoft should take.
Blaming IT managers is akin to blaming a bus driver when the bridge collapses.
IT managers are required to implement AV for compliance. They'd lose their jobs if they refused to do it and either failed an audit or got hit by common malware.
It's ironic to see so much praise this morning for the macOS model for kernel extensions. The fact of the matter is Microsoft allows 3rd party code to do exactly what happened today, and the industry leader in endpoint security completely failed to QA their updates.
Bus drivers allowed themselves to be assigned routes over bridges that were not engineered or built well enough.
It's ridiculous to put the blame on IT managers for this. The fact that Microsoft Azure itself has been affected by this shows that this is not a problem that any middle manager at a small firm should have be held responsible for.
It's obvious that you have never served in corporate IT management, so this is a pointless conversation. I'll just leave this here for anyone else who may be interested.
In many/most orgs, the windows sysadmins and IT management will not have access to the crowdstrike console. The CISO and security teams are completely separate, and operate independently. They will always push for the business need to deploy policy updates at any time, as the threat landscape will not wait for the next patch cycle. And actually, this has been the right call and has objectively prevented compromise many times over.
You were so close to getting this right, too. It's crowdstrike that didn't follow proper software deployment practices when they pushed their update to the channel. Where was their QA? They know how many of their customers have auto-updates enabled (every one of them).
I once worked in an online cloud environment, and the security team had crowdstrike installed everywhere, with the expressed promise that in production it would always be in monitor mode only. Well, suddenly our monitoring pods began to get SIGKILL'ed before they even could init and it took days of investigation.
Root cause? infosec turned on enforcement in crowdstrike, and didn't tell anyone. We didn't have access to their splunk logs, nothing got logged locally, and naturally nobody on my side of the org even knew crowdstrike was running because we had to fight to get SSH access to the k8s nodes.
But sure, blame the IT manager who is now desperately trying to bring up their entire AD infra. That's helpful.
"IT managers" is an awfully big umbrella term. It can refer to anyone making decisions like this. The CIO, the CSO, the CFO (if over IT as in a lot of organizations) and even the CEO. Responsibility may also extend to other mid and senior level managers within the chain of decision making who didn't speak up, to a lesser extent. In my last company I was an individual contributor, and if some one proposed placing an autoupdating, closed source kernel changing program in my critical path, you can bet I would have loudly spoken out against it.
You don't like the term "IT managers"? Let's call them "company managers", suits better?
You as a company allowed and external entity to push a completely uncontrolled update on all your production envs. What if crowdstrike had been hackered and instead of a BSOD you get a cryptolocker?
If you don't realize how crazy is this, then I having nothing to add.
I am sorry but I don't really understand what you are arguing for.
I can see what you are arguing against: it's the unchecked autoupdate policy for updating critical software. The problem here is that almost nobody does that anymore, specially because of the overhead this caused across the industry. To replace the overhead, there are contracts in place that if a supplier messes up they will be held financially accountable. It's called SLA.
Now, as for virus protection: AFAIK nobody ever gated AV updates. OS updates, yes, OS upgrades, even more so. But AV? Not to my knowledge.
What you seem to be arguing for is unrealistic. Consider a 0-day exploit, being frantically pushed by AV vendors to fight against, but the IT fails to gate the update in time. Time and time again the autoupdate saved our collective a*es.
The IT managers are definitely held accountable, so they will definitely insist on NOT gating updates.
The Crowdstrike should be held accountable for this fiasco and not the individuals, that manage each and every company's IT infrastructure. If the said company survives this, that is then a failure of our companies' leadership.
AV is still software, special kind of software but still software.
If you look at how "normal" software updates are handled around the world, you will see a recurring pattern:
- updates are first done in test envs and then in production.
- large production envs are updated in "waves".
- critical updates may go directly into production, but when it happens they need extra authorization and awareness
Please tell me why this pattern cannot be applied to AV updates?
> Now, as for virus protection: AFAIK nobody ever gated AV updates.
What happened today tell us that it is a bad practice, and btw I'm aware of some customer of mine that didn't incur any issue in production because they first updated the test env and spotted the problem.
> The Crowdstrike should be held accountable for this fiasco and not the individuals
We agree that crowdstrike should be held accountable, but they are not the only one.
What happened today tell us that there is a big hole to be plugged. And it can be fixed only but single companies. Crowdstrike, if survives, can improve it's QA process, but who will guarantee you that it won't happen again? What about other vendors? You should always assume that everything can fail and adopt processes that help prevent and mitigate these failures.
Again, think just for a minute, what would be the consequences of the today fiasco if instead of a bad file they would have push a cryptolocker or troian?
Solarwinds tells us that this is not a hypothetical scenario but a real risk.
Not just take down the OS, but render the machine completely unbootable until a config file gets manually removed from the HD by starting in Safe Mode from the console.
Putting aside the semantic swamp of figuring out what an app actually refers to, this seems to be exactly the problem. They should have their signing keys revoked.
https://azure.status.microsoft/en-us/status/history/ doesn't seem to have links to the individual incidents. Some reports claim the Azure / Microsoft 365 outages were related to crowdstrike, but this sounds like an entirely separate incident.
AFAIK the broken crowdstrike channel update happened at 2024-07-19 06:05 UTC and was "fixed" (rolled back) at 06:47 UTC, but I don't have a proper source for that timeline?
EDIT: https://azure.status.microsoft/en-gb/status claims 2024-07-18 19:00 UTC as the approximate start of impact for the crowdstrike update. It would be nice to find a proper source for the start and mitigation timelines...
EDIT: reddit threads reporting symptoms start at approx 2024-07-19 05:00 UTC. That would mean the crowdstrike impact started soon after the azure recovery.
---
What happened?
Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.
What do we know so far?
We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.
How did we respond?
21:56 UTC on 18 July 2024 – Customer impact began
22:13 UTC on 18 July 2024 – Storage team started investigating
22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations
23:27 UTC on 18 July 2024 – All deployments in Central US stopped
23:35 UTC on 18 July 2024 – All deployments paused for all regions
00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed
01:10 UTC on 19 July 2024 – Mitigation started
01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery
02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered
03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery
03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources
Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.
The German BSI (Federal Office for IT Security) quotes the advisory from CrowdStrike (which is behind their customer login portal) as saying you need to roll-back to snapshots prior to 04:09 UTC:
With some of the worst engineers. I had the misfortunate of working with a couple of Microsoft principal engineers and I still shudder at how bad they were and how many layer of abstractions they could write.
> We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage.
https://azure.status.microsoft/en-ca/status