Hacker News new | past | comments | ask | show | jobs | submit login
Major Microsoft 365 outage caused by Azure configuration change (bleepingcomputer.com)
102 points by doener 76 days ago | hide | past | favorite | 64 comments



That's an interesting recommendation...

> We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage.

https://azure.status.microsoft/en-ca/status


Why would multiple reboots make sense? I can accept three reboots triggering some condition that tries three times before it stops trying, but fifteen?


With my experience dating back to 286 machines, sometimes even kicking the box (literally) solves the problem


Put on yer “re-bootin’” boots !


They don't call them reboots for nothing.


Ahaha what nice memories you brought back!


Have you tried turning it off and on and off and on again?


It sounds like a race condition; you want the CrowdStrike updater to start and pull down the fix before the affected virus definition file is loaded and kills the box.

If you keep rebooting, you eventually may get those to load in the right order.


Sounds a lot like my work laptop's antivirus software, which gives me a total of two minutes of internet from within WSL2 before blocking it entirely.


I'm curious if you are sure it's antivirus? WSL networking is weird and can be a pain in the ass. Especially in VPNs are used.

I'm wondering if you know wsl-vpnkit?


Not familiar with this tool, sorry.

Anyway, had that confirmed by whoever is in charge of security and whatnot, as we have an internal StackOverflow clone and someone asking the same question was pointed to a PowerPoint presentation of dos and don'ts.


I see. For me it was worse with WSL1, where all file operations were scanned by slow enterprise DLP software.

Now with WSL2 most pains I see are related to broken routing when also using VPNs on host. That is where that vpnkit tool shines.


I've found disabling/enabling the virtual network bridge fixes VPN issues.

Of course in win11 Microsoft decided to make the bridge hidden for some unknown reason, so good luck figuring it out...


That's what I thought - it would have to be something probabilistic with a 99% chance within 15 reboots or so.


I'm not sure if this has anything to do with it; but I got an email this morning at 2 AM saying that my Microsoft account password was changed. The email was authentic - it came from Microsoft's servers, and had no buttons for me to click. It said that the IP Address of the password change was my own ISP in my area... and that the reset came from a phone number I didn't recognize on my security info, that even now does not show on my account dashboard.

I ran in a panic to reset my password... only to discover, the password was "changed" to the exact same password? And how would they even get in the account without 2FA? My "sign in history" also showed no trace of anything unusual. At this point, and reading these headlines, I feel more confident something's broke.


And CloudStrike hitting them at the same time must have been just unbelievable.


The Crowdstrike issue is causing the Azure issue as I understand, lots of machines using that software that all updated to the blue screen feature around 6 PM apparently.


Nope. Azure does not run CrowdStrike. It doesn't even run Windows as you know it.


ah you are right, I misread the Microsoft notification


Also Microsoft using Crowd strike instead of their own defender for endpoint?

Guess it could be part of a risk mitigation strategy to not but all their eggs in that product.


OK, it's a pedantic nit, but doesn't anybody have editors anymore?

> This massive outage started around 6:00 PM EST

We're currently in daylight savings time, so the time should read "EDT".


Furthermore, the article states that it was "affecting customers across the Central US region". Doesn't it seem like they should have used "5:00 PM CDT" in the article?


This entire thing is not surprising but it's fascinating to see unfold. In many dimensions, even listening to the wild speculation or weird reporting around it.


I guess we can call it a "black Thursday" for microsoft.


Why is MS getting any blame for the Crowdstrike thing?


They allow a 3rd party to change Kernel stuff with an update. Apple banned this a while ago.


It is supposed to be like that.

Antivirus software always works as a driver in the kernel, no other way. You'll get the same in Linux, for example. In MacOS it may be slightly better (if I remember right Darwin is a micro-kernel), but in fact a broken driver still can crash the system there.


> Antivirus software always works as a driver in the kernel, no other way.

You're confidently wrong: https://developer.apple.com/support/kernel-extensions/


This page is only about using some APIs, that are now supposed to be called through wrappers. I would say it significantly limits the developers, and also may introduce additional flaws.


Yet it is how antivirus works on Mac now.


*banned

Made it a lot harder for everyone involved, but still possible, as it’s a very useful technique.

PS: Since I'm being downvoted, here is the link showing that it is still possible using Reduced Security:

https://support.apple.com/en-gb/guide/mac-help/mchl768f7291/...

I doubt Apple will completely disable kext in the near future. Making it hard enough to be impractical has most of the benefits already.


Seems like microsoft is getting put into the headlines because the Crowdstrike versions for mac and linux aren't affected. Plus Crowdstrike has a history of pushing people onto microsoft tech support when their software causes problems.


Microsoft should have a certification program for software that messes around with kernel modules that could cause a BSOD. I guess they already do this for hardware drivers.

Once software is effecting a certain number of endpoints the vendor has to prove they are testing sufficiently.


I would strongly oppose such restrictions. Of course I use Linux so I can install whatever kernel.modules I want to and no one can stop me :)


Hot take, but I think you should build from source to be able to modify the kernel. Kernel extensions don't have their place in our modern world.


Crowdstrike has caused kernel panics on Linux systems in the past too. I push FOSS hard and will hold that this is an example where using FOSS for all critical business software would have saved companies, but I don't particularly blame Microsoft directly for this outage. Incompetent IT managers buying software from their golf buddies is the real heart of the issue.


I don't blame Microsoft in the tactical sense for this outage, but I blame them in the strategic sense. Here's what I mean.

Microsoft is all in on kernel extensions, when Apple has shown that you can deprecate them and move the most important use cases outside the kernel. I blame Microsoft for not starting the herculean task of deprecating kernel extension. Remember, Satya Nadella said recently that Microsoft will put security above everything else, even backward compatibility. Then the Recall fiasco happened and Nadella was caught pants down with the useless value of his word.


Apple gets away with a lot of things MS would never get away with, so I am not sure that saying "apple did it" is a good argument. It still may be a good idea to move away from kernel modules, I would tend to agree with that in general. But there is a whole class of problems that come from allowing third parties to push updates to your critical IT systems- kernel or not.


You’re right that Apple can get away with things Windows can’t. But in the case of kernel extensions, I think Apple has shown a valid path Microsoft should take.


HN user in other thread claimed same issue happened last week with either Mac or gnu/Linux, I don’t recall which


Well, if you look on all news sites, they put highlight that is a windows issue.

I personally think that IT managers should be blamed for the disaster, but in the collective imagination this will e a microsoft/windows problem.


Blaming IT managers is akin to blaming a bus driver when the bridge collapses.

IT managers are required to implement AV for compliance. They'd lose their jobs if they refused to do it and either failed an audit or got hit by common malware.

It's ironic to see so much praise this morning for the macOS model for kernel extensions. The fact of the matter is Microsoft allows 3rd party code to do exactly what happened today, and the industry leader in endpoint security completely failed to QA their updates.


IT managers allowed an external entity to do a massive update on all their devices and servers.

I'm not saying to don't install AV or not update it, just first update system-test envs and then roll out patches gradually, not all together.


Bus drivers allowed themselves to be assigned routes over bridges that were not engineered or built well enough.

It's ridiculous to put the blame on IT managers for this. The fact that Microsoft Azure itself has been affected by this shows that this is not a problem that any middle manager at a small firm should have be held responsible for.


Are you serious?

If I push a code change in production before going through the test environment and this cause trouble, I'll be fired.

We already have rules and procedures that work for "normal" software, why they cannot be applied to AV software?


It's obvious that you have never served in corporate IT management, so this is a pointless conversation. I'll just leave this here for anyone else who may be interested.

In many/most orgs, the windows sysadmins and IT management will not have access to the crowdstrike console. The CISO and security teams are completely separate, and operate independently. They will always push for the business need to deploy policy updates at any time, as the threat landscape will not wait for the next patch cycle. And actually, this has been the right call and has objectively prevented compromise many times over.

You were so close to getting this right, too. It's crowdstrike that didn't follow proper software deployment practices when they pushed their update to the channel. Where was their QA? They know how many of their customers have auto-updates enabled (every one of them).

I once worked in an online cloud environment, and the security team had crowdstrike installed everywhere, with the expressed promise that in production it would always be in monitor mode only. Well, suddenly our monitoring pods began to get SIGKILL'ed before they even could init and it took days of investigation.

Root cause? infosec turned on enforcement in crowdstrike, and didn't tell anyone. We didn't have access to their splunk logs, nothing got logged locally, and naturally nobody on my side of the org even knew crowdstrike was running because we had to fight to get SSH access to the k8s nodes.

But sure, blame the IT manager who is now desperately trying to bring up their entire AD infra. That's helpful.


"IT managers" is an awfully big umbrella term. It can refer to anyone making decisions like this. The CIO, the CSO, the CFO (if over IT as in a lot of organizations) and even the CEO. Responsibility may also extend to other mid and senior level managers within the chain of decision making who didn't speak up, to a lesser extent. In my last company I was an individual contributor, and if some one proposed placing an autoupdating, closed source kernel changing program in my critical path, you can bet I would have loudly spoken out against it.


You don't like the term "IT managers"? Let's call them "company managers", suits better?

You as a company allowed and external entity to push a completely uncontrolled update on all your production envs. What if crowdstrike had been hackered and instead of a BSOD you get a cryptolocker?

If you don't realize how crazy is this, then I having nothing to add.


I am sorry but I don't really understand what you are arguing for.

I can see what you are arguing against: it's the unchecked autoupdate policy for updating critical software. The problem here is that almost nobody does that anymore, specially because of the overhead this caused across the industry. To replace the overhead, there are contracts in place that if a supplier messes up they will be held financially accountable. It's called SLA.

Now, as for virus protection: AFAIK nobody ever gated AV updates. OS updates, yes, OS upgrades, even more so. But AV? Not to my knowledge.

What you seem to be arguing for is unrealistic. Consider a 0-day exploit, being frantically pushed by AV vendors to fight against, but the IT fails to gate the update in time. Time and time again the autoupdate saved our collective a*es.

The IT managers are definitely held accountable, so they will definitely insist on NOT gating updates.

The Crowdstrike should be held accountable for this fiasco and not the individuals, that manage each and every company's IT infrastructure. If the said company survives this, that is then a failure of our companies' leadership.


AV is still software, special kind of software but still software.

If you look at how "normal" software updates are handled around the world, you will see a recurring pattern: - updates are first done in test envs and then in production. - large production envs are updated in "waves". - critical updates may go directly into production, but when it happens they need extra authorization and awareness

Please tell me why this pattern cannot be applied to AV updates?

> Now, as for virus protection: AFAIK nobody ever gated AV updates.

What happened today tell us that it is a bad practice, and btw I'm aware of some customer of mine that didn't incur any issue in production because they first updated the test env and spotted the problem.

> The Crowdstrike should be held accountable for this fiasco and not the individuals

We agree that crowdstrike should be held accountable, but they are not the only one.

What happened today tell us that there is a big hole to be plugged. And it can be fixed only but single companies. Crowdstrike, if survives, can improve it's QA process, but who will guarantee you that it won't happen again? What about other vendors? You should always assume that everything can fail and adopt processes that help prevent and mitigate these failures.

Again, think just for a minute, what would be the consequences of the today fiasco if instead of a bad file they would have push a cryptolocker or troian? Solarwinds tells us that this is not a hypothetical scenario but a real risk.


Because their OS allowed a third-party tool to crash it, worldwide, with no automated way of recovery?


Letting an app take down your OS, perhaps.


Not just take down the OS, but render the machine completely unbootable until a config file gets manually removed from the HD by starting in Safe Mode from the console.


It's not an "app", it's a kernel driver. A serious driver bug is catastrophic in nearly any OS.


Putting aside the semantic swamp of figuring out what an app actually refers to, this seems to be exactly the problem. They should have their signing keys revoked.


The Central US outage really exposed a lot of shortcomings both at my company and at MSFT apparently.

Even having regional fail over available wouldn't help because the control plane was unreliable meaning it couldn't be triggered by anyone.


https://azure.status.microsoft/en-us/status/history/ doesn't seem to have links to the individual incidents. Some reports claim the Azure / Microsoft 365 outages were related to crowdstrike, but this sounds like an entirely separate incident.

AFAIK the broken crowdstrike channel update happened at 2024-07-19 06:05 UTC and was "fixed" (rolled back) at 06:47 UTC, but I don't have a proper source for that timeline?

EDIT: https://azure.status.microsoft/en-gb/status claims 2024-07-18 19:00 UTC as the approximate start of impact for the crowdstrike update. It would be nice to find a proper source for the start and mitigation timelines...

EDIT: reddit threads reporting symptoms start at approx 2024-07-19 05:00 UTC. That would mean the crowdstrike impact started soon after the azure recovery.

---

What happened?

Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.

What do we know so far?

We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.

How did we respond?

21:56 UTC on 18 July 2024 – Customer impact began

22:13 UTC on 18 July 2024 – Storage team started investigating

22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations

23:27 UTC on 18 July 2024 – All deployments in Central US stopped

23:35 UTC on 18 July 2024 – All deployments paused for all regions

00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed

01:10 UTC on 19 July 2024 – Mitigation started

01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery

02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered

03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery

03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources

Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.


The German BSI (Federal Office for IT Security) quotes the advisory from CrowdStrike (which is behind their customer login portal) as saying you need to roll-back to snapshots prior to 04:09 UTC:

https://www.bsi.bund.de/SharedDocs/Cybersicherheitswarnungen...

So your Redditor saying 05:00 UTC seems to be close.


https://old.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_e... quotes the CrowdStrike advisory verbatim, which has timestamps of 04:09 UTC for the problematic version and 05:27 for the reverted (good) version.


Honestly is anyone surprised that Microsoft is having these issues? They still run the same sloppy software as they always have.


With some of the worst engineers. I had the misfortunate of working with a couple of Microsoft principal engineers and I still shudder at how bad they were and how many layer of abstractions they could write.


Not at MS, but I had the misfortune of working with such "engineers" as well.

I belive the term is "architecture astronauts".


[flagged]


What's so interesting about that?


2 calamities so close together in space and time. It makes you wonder if there's a connection.

They are both universally distracting. That's one special thing that they share.

A conspiracist might suggest that our attention is being handled.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: