It feels like there are some missing dots and connections here: I see how a concurrency or memory safety bug can accidentally exfil a private key into a debugging artifact, easily, but presumably the attacker here had to know about the crash, and the layout of the crash dump, and also have been ready and waiting in Microsoft's corporate network? Those seem like big questions. "Assume breach" is a good network defense strategy, but you don't literally just accept the notion that you're breached.
> but presumably the attacker here had to know about the crash, and the layout of the crash dump
If I were an advanced persistent threat attacker working for China who had compromised Microsoft's internal network via employee credentials (and I'm not), the first thing I'd do is figure out where they keep the crash logs and quietly exfil them, alongside the debugging symbols.
Often, these are not stored securely enough relative to their actual value. Having spent some time at a FAANG, every single new hire, with the exception of those who have worked in finance or corporate regulation, assumes you can just glue crash data onto the bugtracker (that's what bugtrackers are for, tracking bugs, which includes reproducing them, right?). You have to detrain them of that and you have to have a vault for things like crashdumps that is so easy to use that people don't get lazy and start circumventing your protections because their job is to fix bugs and you've made their job harder.
With a compromised engineer's account, we can assume the attacker at least has access to the bugtracker and probably the ability to acquire or generate debug symbols for a binary. All that's left then is to wait for one engineer to get sloppy and paste a crashdump as an attachment on a bug, then slurp it before someone notices and deletes it (assuming they do; even at my big scary "We really care about user privacy" corp, individual engineers were loathe to make a bug harder to understand by stripping crashlogs off of it unless someone in security came in and whipped them. Proper internal opsec can really slow down development here).
> but presumably the attacker here had to know about the crash, and the layout of the crash dump
another statement from the article:
> Our credential scanning methods did not detect its presence (this issue has been corrected).
The article does not give any timeline when things happened.
Imagine the following timeline:
- hacker gets coredump in 2021, doesn't know that it contains valuable credentials.
- For data retention policy reasons, Microsoft deletes their copy of the coredump — but hacker just keeps it.
- Microsoft updates its credential scanning methods.
- Microsoft runs updated credential software over their reduced archive (retention policy) of coredumps. As that particular coredump doesn't exist anymore at Microsoft, they are not aware of the issue.
- hacker get scanner update.
- hacker runs updated credential scanner software over their archive of coredumps. Jackpot.
>... you have to have a vault for things like crashdumps that is so easy to use that people don't get lazy...
Let's assume a crash dump can be megabytes up to gigabytes big.
How could a vault handle this securely?
the moment it is copied from the vault to the developer's computer, you introduce data remanence (undelete from file system).
keeping such coredump purely in RAM makes it accessible on a compromised developer machine (GNU Debugger), and if the developer machine crashes, its coredump contains/wraps the sensitive coredump.
A vault that doesn't allow direct/full coredump download, but allows queries (think "SQL queries against a vault REST API") could still be queried for e.g. "select * from coredump where string like '%secret_key%'".
So without more insight, a coredump vault sounds like security theater which tremendously makes it more difficult for intended purposes.
Everything is imperfect, but where I work crashdumps are uploaded straight to a secure vault and then deleted from the origin system. The dumps are processed, and insensitive data is extracted and published with relatively lenient access controls. Sensitive data, such as raw memory dumps, require a higher tier of permissions. In order to be eligible for that higher tier, your developer machine has to be more locked down than that of people who are not in the secure group. (You also need to have a reason to need more access.)
Given that stack traces, crash addresses, and most register contents are considered to be security insensitive, most people don't really need access to the raw dumps.
It's far from perfect, but it would be unfair to call it "security theater". It seems like a pretty decent balance in practice. Admittedly, we have the slight advantage of several hundred millions installs, so the actual bugs that are causing crashes are likely to happen quite a few times and statistical analysis will often provide better clues than diving deep into an individual crash dump.
> Everything is imperfect, but where I work crashdumps are uploaded straight to a secure vault and then deleted from the origin system. The dumps are processed, and insensitive data is extracted and published with relatively lenient access controls. Sensitive data, such as raw memory dumps, require a higher tier of permissions. In order to be eligible for that higher tier, your developer machine has to be more locked down than that of people who are not in the secure group. (You also need to have a reason to need more access.)
From my understanding, this is more or less how the Microsoft system was designed with credential scanning and redaction over coredumps, but a chain of bugs and negligence broke the veil.
While your points are all valid theoretically, keeping stuff off of developer filesystems can still help a lot practically.
This attacker probably (it's unclear, since the write-up doesn't tell us) scanned compromised machines for key material using some kind of dragnet scanning tool. If the data wasn't on the compromised filesystem, they wouldn't have found it. Even though perhaps in theory they could have sat on a machine with debug access (depending on the nature of the compromise, this is even a stretch - reading another process's RAM usually requires much higher privilege than filesystem access) and obtained a core dump from RAM.
Security is always a tension between the theoretical and the practical and I think "putting crash dumps in an easy-to-use place that isn't a developer's Downloads folder" isn't a bad idea.
Ephemeral compute/VM/debug environment with restricted access.
Tear down the environment after the debugging is done.
Keeping the crash dumps in a vault presumably allows more permission/control that an internal issue tracker (usually anyone can access the issue tracker). At least a vault can apply RBAC or even time based policies so these things aren't laying around forever.
The article says that the employee compromise happened some time after the crash dump had been moved to the corporate network. It says that MS don't have evidence of exfil, but my reading is that they do have some evidence of the compromise.
The article also says that Microsoft's credential scanning tools failed to find the key, and that issue has now been corrected. This makes me think that the key was detectable by scanning.
Overall, my reading of this is that the engineer moved the dump containing the key into their account at some point, and it just sat there for a time. At a later point, the attacker compromised the account and pulled all available files. They then scanned for keys (with better tooling than MS had; maybe it needed something more sophisticated than looking for BEGIN PRIVATE KEY), and hit the jackpot.
Red teams and malicious actors have plenty of tools which automated the looting and look for juicy things. Crash dumps, logs, and many others... The bottom line is that if there is a secret stored on disk somewhere, it won't take long for a proper actor to find it.
Oh, "jackpot" was just a figure of speech, I didn't intend to imply any particular probability. Not sure what the chance of finding sensitive information in the private files of an engineer is, but I would guess a lot better than one in a million. One in a hundred, maybe? One in ten?
I think the most likely explanation is that this actor routinely attempts to compromise big-tech engineers using low-sophistication means, then grabs whatever they can get. Keep doing that often enough, for long enough, and you get something valuable -- that's the "persistent" in APT.
it brings a lot of questions to the table about what employee knew what, and when.. A real question is - under a "zero trust" environment, how many motivated insiders have they accumulated with their IT employment and contracting.
'After April 2021, when the key was leaked to the corporate environment in the crash dump, the Storm-0558 actor was able to successfully compromise a Microsoft engineer’s corporate account. This account had access to the debugging environment containing the crash dump which incorrectly contained the key.'
So either the attacker was already in the network and happened to find the dump while doing some kind of scanning that wasn't detected, or they knew to go after this specific person's account.
Or they knew/discovered that there was a repository of crash dumps - likely a widely known piece of information - and just grabbed as much as they could. Nothing in the write-up indicates any connection between the compromised engineer and this particular crash dump, other than they had access.
I believe there are somewhat standard tools for scanning memory dumps for cryptographic material, which have been around since the cold boot attack era. And I can imagine attackers opportunistically looking for crash dumps with that in mind. But it does seem like an awfully lucky (for the attacker) sequence of events...
I just checked the source and openssh doesn't appear to set madvise(MADV_DONTDUMP) anywhere :-( That seems like an oversight? For comparison openssl has a set of "secure malloc" functions (for keys etc) which uses MADV_DONTDUMP amongst other mitigations.
sshd runs as root, so the core dumps would be readable as root-only, no? If you have root access already you could dump it even while it's still running with ptrace anyways
>sshd runs as root, so the core dumps would be readable as root-only, no
Yes, although the article we're discussing shows that you can't rely on that, the dump could be subsequently moved to a developer machine for investigation, and unencrypted key material left in could be compromised that way... defense in depth would make sense here.
Secret materials for ssh keys won’t be in sshd. They stay client side. Granted m, host keys could be compromised, so you could impersonate a server, but a sshd key leak won’t give direct access
MADV_DONTDUMP or MAP_CONCEAL don't appear anywhere in the source, client or server (with the exception of the seccomp filter where they're just used to filter potential system calls).
Key material aside, such a coredump could give some hints towards someone else’s capabilities, and point you in an interesting direction for finding new and exciting ways to own more shit.
There are entire systems engineering courses focused on failure resulting from a series of small problems that eventually in the right succession result in catastrophic failure. And I think we can say this was a catastrophic failure.
Think about it, first you need a race condition, and that race condition has to result in the unexpected result. That right there, assuming this code has been tested and is frequently used, is probably a less than 10% chance (if it was frequently happening someone would have noticed.) Then you need an engineer to decide they need this particular crash dump. Then you need your credential scanning software (which again, presumably usually catches stuff) to not be able to detect this particular credential. Now you need an account compromised to get network access and that user has access to this crash dump and the hacker happens to get to it and grabs it.
But even then, you should be safe because the key is old and is only good to get into consumer email accounts...except you have a bug that accepts the old key AND a bug that didn't reject this signing key for a token accessing corporate email accounts.
This is a really good system engineering lesson. Try all you want eventually enough small things will add up to cause a catastrophic result. The lesson is, to the extent you can, engineer things so when they blow-up the blast radius is limited.
> that eventually in the right succession result in catastrophic failure.
With a caveat that when it comes to security the eventual succession doesn't come as a random process but will be actively targeted and exploited. The attackers are not random processes flipping coins, rather they can flip a coin that often lands on "heads", in their favor.
The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft, there just happened to be a race condition, and then a crash randomly happened, and then the attacker just happened to find the crash dump somewhere. We should consider even starting with the initial "race condition" bug, that it might have been inserted deliberately. The crash could have been triggered deliberately. An attacker may have been expecting the crash dump to appear in a particular place to grab it. The attacker may have had accomplices.
The other frightening possibility is that the attack surface targeted by persistent threat actors is so large that a breach becomes certain (the law of large numbers): when you have so many accounts owned that one of them will have the right access rights; when you have so many dumps one of them will have the key; etc ...
> The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft
Does it say that?
> the Storm-0558 actor was able to successfully compromise a Microsoft engineer’s corporate account
Race condition is the reason we all use to explain to management why we wrote a stupid bug. Everything is a race condition: "the masker is asynchronous so the writer starts writing dumps before the masker is setup" sounds like a completely moronic thing to do. Say there is a race condition, and people say "a less than 10% chance from happening", but what do we know, maybe it happens each big crash, and it just doesn't crash that often.
Why isn't it masking before writing to disk ? God only knows.
Crash handlers don't know what state the system will be in when they're called. Will we be completely out of memory, so even malloc calls have started failing and no library is safe to call? Are we out of disk space, so we maybe can't write our logs out anyway? Is storage impaired, so we can write but only incredibly slowly? Is there something like a garbage collector that's trying to use 100% of every CPU? Are we crashing because of a fault in our logging system, which we're about to log to, giving us a crash in a crash? Does the system have an alarm or automated restart that won't fire until we exit, which our crash handler delays?
It's pretty common to keep it simple in the crash handler.
Unknown unknown catastrophic failures like this one have always happened and will continue to happen, that's why we need resilience which, probably, means a less centralised worldview.
Which should probably mean that half (or more) of the Western business world relying on Outlook.com is a very wrong thing to have in place, but as the current money incentives are not focused on resilience nor on stuff like breaking super-centralized Outlook.com-like entities down means that I'm pretty sure we'll continue having events like this one happening well into the future.
Indeed. While reading that I thought to myself “gosh, that’s a lot of needles that got threaded right there”. It feels like the Voyager Grand Tour gravitationally-assisted trajectory… happening by mistake.
A lot of accident analysis reads like this (air accident reports especially tend to read like they've come from a writer who's just discovered foreshadowing). And often there's a few points where it could have been worse. There's a reason for the "Swiss cheese" model of safety. The main thing to remember is there's not just one needle: it's somewhere between a bundle of spaghetti and water being pushed up against the barriers, and that's before you assume malicious actors.
Yeah I get that, it’s not a single Voyager, it’s millions of them sent out radially in random directions and random speeds and one or two of them just happen to thread the needle and go on the Grand Tour. It’s just an impression. Plus as you say there’s the selective element of an intelligence deliberately selecting for an outcome at the end (which confusingly is also a beginning).
"reducing your blast radius" is never truly finished, so how do you know what is sufficient, or when the ROI on investing time/money is still positive?
* July 11 2023 this was caught, April 2021 it was suspected to have happened. So, 2+ years they had this credential, and 2 months from detection until disclosure.
* How many tokens were forged, how much did they access? I'm assuming bad if they didn't disclose.
* No timetable from once detected to fix implemented. Just "this issue has been corrected". Hope they implemented that quickly...
* They've fixed 4 direct problems, but obviously there's some systemic issues. What are they doing about those?
Its valid in a civil court where discovery processes exist. It doesn't really apply to public relations, where information could be withheld for a number of unknowable reasons. Of course everyone is free to speculate, but its not supported by a link to theories of common law in civil torts.
A breach like that requires a very good understanding of Microsoft's internal infrastructure. It's safe to assume that the breach was a coordinated effort of a team of hackers. This is not a cheap effort, but the payback is enormous. Hyper-centralization leads to a situation when hackers concentrate their efforts on a few high-value targets because once they are successful, the catch is enormous. I'm pretty much sure that there are teams of (state-sponsored) hackers that are already doing deep research and analysis of the internal infrastructure of Google, Microsoft, Amazon, etc. The breach gives an idea of how well already the hackers understand it.
I would argue, it's time to decentralize inside a wider security perimeter.
You have to assume that you have nation state actors working at your organization at sufficient size. Unfortunately, it’s difficult to work around this assumption, because anyone can be compromised at any time.
I find it somewhat amusing that companies like Microsoft and Google that have pivoted a large portion of their business model to collecting, keylogging, recording, scanning, exfiltrating, telemetrizing, collating, inferring, and analyzing every last iota of data they can about as many people as possible under the guise of improving their products or personalizing ads...
... can't identify nation state actors within their own company.
I suppose that would be illegal. Whereas using it to improve AdSense CTR or selling it to brokers is perfectly acceptable.
If you are up against an adversary with an unlimited budget and organisational event horizon measured in years, your quarter-to-quarter thinking will always kneecap you.
I dunno, I reckon the amount Microsoft pay in defensive security and the amount China pay for offensive cyber security are going to be in the same order of magnitude.
The real advantage is that MS has to play at least somewhat inside the legal system.
I have to disagree, because it's more subtle than just (im)balance of budgets. It's about the highly asymmetric nature of the ongoing conflict.
A nation state has effectively unlimited budget in money, but more than that, their incentives are different. Any defender has to maintain an increasingly complex system with an evolving attack surface. A nation state attacker has to maintain ongoing access to any parts of that system. Access grants opportunities. They can wait, and can afford to do so. They have a massive time budget to tap. A company who does not prioritise or budget ongoing maintenance will eventually reassign their expensive resources to projects that do produce visible or at least measurable results. And in doing so, they neglect the unmeasureable outcomes from the prior projects that are now starved of proper resources. (Or even just attention.)
Compromises will happen. That's the ground truth. The important part is the blast radius.
In this particular case the impact was magnified by string of failures. Missing or ineffective revocation of a signing certificate was a big factor, but the failure was further compounded by the applicable scope of what that certificate could sign things for. Those two process failures caused this incident - everything else is attributable to bugs.
In short, MS dropped the ball on an organisational level.
Because for them these operations are part of their military and intelligence spend. And in terms of allocation from the tax pot, both are highly privileged.
Right, but take Mossad, a well-funded intelligence agency. Their annual budget is estimated to be about $2.73bn[0] ... how much do you think Big Tech spend on cyber-security?
In total? Maybe an order of magnitude more - so across all of Big Tech, in all projects and ongoing maintenance activities that are directly enabling or powering their cybersecurity aspects... I'd say $25B. Altogether.
Let's look at the other side of the equation and see what they are up against.
Now, obviously Mossad won't be spending all of their $2B on offensive stuff in this space, but on a ballpark estimate I'd say their spend on offensive cyber[tm] is between $300M and $500M.
From vulnerability equity programs we know that the going rate to acquire a reliable exploit against a high-value, hard target is between $1M and $3M. So we can safely infer that it would cost at least that much to develop one from scratch. Let's be charitable and say that on average it costs around $2M to develop a bespoke exploit against a hardened, niche target. These exploits also have their shelf-life and will eventually get burned. Again, we can be charitable and say that on average an exploit remains useful for maybe 18 months. Over that time the adversaries will either have developed their own parallel methods, or will be buying another one to replace their now-expired product.
That puts the expense floor, before operation staffing costs, to somewhere around $750k per year just to stay in place with regards to access technology. For high-value, bespoke targets where no exploits are readily available from the brokers, you can still expect to spend about $2M per year to develop and maintain a matching capability.
Then let's consider the operation personnel costs. A well run intelligence operation is probably not a sweatshop. With our venerable Stetson-Harrison estimation method we can put an average headcount per operation to 9 people. An operation lead, two analysts, three software engineers, one project manager, and two support staff. Let's say the fully loaded cost for lead and manager is $300k each, for software engineers $250k each, and for analysts/support $190k each.
So a single operation could expect to have annual, ongoing costs just a hair under $4M. Around half for maintaining the access technology and the rest to keep the operation going. Let's also say that the intelligence agency maintains a discretionary buffer budget to absorb occasional one-off cost runs, so that if a project for some reason generates an extra $1.5M charge in one year, it'll be expected slippage and already accounted for.
At $4M per year, per project, and a minimum of $300M to spend on such projects, you can maintain a lot of access operations. For just one intelligence agency.
When I said that nation state adversaries have effectively unlimited budgets, I meant that they can keep paying those sums just to maintain their access. For them, the ongoing access itself is an essential means to an end. They don't attack systems because they need to get through a system. They do that to achieve their objectives. Those are not "breach systems X and Y" - they are more along the lines of "collect and exfiltrate information of type M". How they do that is irrelevant. And as long as they can maintain their access to relevant source(s) of valuable enough intelligence data, they can keep paying for it, year after year.
Not just in money. They can keep the operation staffed. And they can afford to wait.
As a defender against such adversaries, you will ALWAYS be at a disadvantage. You need to keep a complex, ever evolving system secure and shut attackers out, plus you can not afford to make mistakes. Your adversaries only need you to make one. If not this year, then maybe one after the next. They can sustain their ongoing operations, while you, as a defender in a corporation subject to quarter-to-quarter thinking, have to keep justifying your work (and the ongoing expense) on systems that only show measurable results when they fail in their purpose.
So while my original statement may not be technically true, for all intents and purposes nation-state adversaries do have unlimited budgets. And no amount of expenditure will make you invulnerable.
There was a security personality who said (roughly, paraphrased), the following:
> The biggest weakness and strength in security is loyalty and ego.
Nation states can do things that private corporations can't. Appealing to ego is a big one. National loyalty is another. That, in addition to blackmail, bribes, offering a save-haven, etc... are hard to compete with.
That being said, defense mechanisms can be build that are make insider compromise difficult to fight against. For example, HSMs are one key tool that make insider compromise much more difficult. I've worked at a non-zero number of companies that hired external firms (think the Xerox and ATTs of the world) to perform security critical activities, or audit employee requests -- that didn't work so well, but there are tools here. Most of the time they aren't just "Limit access to the people that need it" -- in fact, if you rely really heavily on ACLs, and not an inherit built-in security model to your system, I'd say the risk of something going wrong is much higher.
So if we remove the careful wording, someone downloaded a minidump onto a dev workstation from production and then it was probably left rotting in corporate OneDrive until that developer's account was compromised. Someone took the dump, found a key in it and hit the jackpot.
And, crucial to this exploit actually working to the extent it did, Microsoft's own developers failed to implement a secure authentication check on top of their own libraries and infrastructure.
Also completely failing to check the scope of the request before validating it!
> Microsoft provided an API to help validate the signatures cryptographically but did not update these libraries to perform this scope validation automatically
And there was a redaction system, which did not redact the key ("race condition"). Then a detection system, which didn't detect the key. And then the key was used to access an entirely different system with an entirely different access level and it just worked anyway.
The phrasing as "some obscure bugs were carefully exploited" seems a bit off, it looks more like a comedy of errors where none of the security systems served its purpose at all.
That's because the whole idea of a redaction system is stupid.
You can't start out with something unconstrained and expect to patch all the holes in it. Mathematically speaking, you start with set A which is all possibilities and set B which is all the things you know to remove and you end up with A \ B not {}.
You have to start out with something constrained and allow only the good bits through the holes.
Having a similar discussion at the moment. Which is the correct solution:
I agree in general, but for "signing key material", there should be enough entropy and they have enough control over the format to make it very easy to detect.
Plus, they are the ones who put the security of their system behind this detection, letting developers access the dump they believed to be redacted. Whether they made a massive mistake at the design or implementation phase doesn't really absolve them.
These guys [1] claim to have "the fastest payment HSM in the world, capable of processing over 20,000 transactions per second." I imagine the peak load for signing authtokens for Microsoft accounts is way higher than that.
So once an hour, each auth server requests a certificate (for a new private key) from the HSM. It caches that for the hour, and issues certificates for the clients signed by its private key - and puts them in a token including the chain with the cert from the HSM and the cert from the auth server. Clients validate no cert in the chain is expired.
That way, the HSM only needs to do one transaction per hour per auth server. If auth tokens need to be valid for 24 hours, then the certificates from the HSM need to be valid for about 25 hours (plus some leeway for refresh delays maybe).
If someone compromises the auth server and gets the private key (or gets in a position to request a cert from the HSM), then it is still quite bad in the sense that they have up to 25 hours to exploit it. But if this is only one of many controls, it still provides significant defence in depth, and cuts off certain types of attacks, especially for APTs who might not have any available TTPs to gain persistence in a highly secure auth server environment and who only briefly manage to gain access or get access to stale information as in this case.
Is there a reason why they couldn't split the load across multiple HSM? For something so sensitive I would've expected a design where one or more root/master keys (held in HSM) are periodically used to sign certificates for temporary keys (which are also held in HSM). The HSMs with the temporary keys would handle the production traffic. As long as the verification process can validate a certificate chain, then this design should allow them to scale to as many HSMs as are needed to handle the load...
HSM are expensive, the performance is bad, and administration is a pain. They're almost certainly running many clusters of their auth servers around the world, and would need significant capacity at all the locations, in case traffic shifts.
It's probably a better idea to pursue short lived private keys, rather than HSMs. If the timeline is accurate, the key was saved in a crash dump in 2021 and used for evil in 2023, monthly or quarterly rotation would have made the key useless in the two year period.
A certificate chain is a little too long to include in access tokens, IMHO, but I don't know how Microsoft's auth systems work.
As a sibling commenter mentioned - if a HSM dumps its memory where it contains private key material, that’s a spectacularly bad HSM, which MS wouldn’t have been able to fix the race condition of.
Reading that MS were able to fix the crashing system’s race condition that included the key, it’s likely to have been a long-lived intermediate key for which the private key was held in memory (with a HSM backed root key for chain of trust validation, assuming MS aren’t completely stupid).
The challenge is the sheer scale these servers operate in terms of crypto-OPS… it would melt most dedicated HSMs.
What this means is that the keys are not stored in non-recoverable hardware, they are available to a regular server process, just some compiled code, running in an elevated-priv environment.
There is no mention that the systems that had access to this key were in any other than the normal production environment, so we may extrapolate that any production machine could get access to it and therefore anyone with access to that environment could potentially exfil the key material.
Looking at the validation section of https://learn.microsoft.com/en-us/azure/active-directory/dev... - did I miss something or does that still lack any mention of the importance of checking dates or revocation for the issuer? Since the pseudo code doesn’t I’d bet there are more implementations which trust any key Microsoft has ever published (modulo some kind of cache purge).
This is the core problem. Everybody is discussing the crash dump and the exfil, but the core problem is that Microsoft neither validated the validity of keys (the leaked key was already invalid) nor the context of the key usage (the key wasn't allowed to generate admin tokens). They just checked if the key was signed by the Microsoft CA.
This is something that's incredibly obvious in a code review.
I feel like an issue that really got them was that the keys weren’t rotated. It sounds like quite some time passed between when the key was moved where it didn’t belong and when it got snatched. If keys were rotated frequently, it would not have been possible to use it to forge a token.
I feel this everytime one of these articles comes out, but it seems totally bizarre to me that we rely on private enterprises to deal with state-level attacks simply because they are digital and not physical.
If a Chinese fighter jet shot down a FedEx plane flying over the Pacific, that would be considered an attack on US sovereignty and the government would respond appropriately. Certainly we wouldn't expect FedEx to have to own their own private fleet of fighter jets to protect their transport planes. No one would be like, "Well it's FedEx's fault for not having the right anti-aircraft defenses."
But somehow, once it hits the digital domain we're just supposed to accept that Microsoft is required to defend themselves against China and Russia.
> If a Chinese fighter jet shot down a FedEx plane flying over the Pacific, that would be considered an attack on US sovereignty and the government would respond appropriately
But if a bunch of Chinese people robbed a US bank, let's say the federal reserve, causing enormous financial damage but not loss of life, the response would be similar. Especially so if their link to the actual Chinese government was suspected couldn't reliably be proven.
Governments catch foreign agents somewhat regularly, and those captures don't lead to an all-out war.
Perhaps - but, whether or not people from $ForeignNation are involved, U.S. banks (or other corporations, or ordinary citizens) generally do not need to have their own armed police/security forces to deal with armed robberies. Nor their own DA's, courts, etc.
Vs. any "cyber" crime? All that nice stuff about "...establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare..." falls on the floor, and...YOYO.
It's absolutely the responsibility of financial institutions to secure their premises and systems. Banks have massive security departments, guards, access restrictions, systems to detect fraud, vaults etc. The government only gets involved once a crime is reported, not in securing facilities.
In fact, inadequately protecting their assets from mistakes or attacks can lead to SEC fines on top of losses,
If there is an attack in progress, the police will intervene, of course. But if it leads to a financial institution collapsing because all their money was in the one place and they weren't insured, then that's the fault of the bank.
Are you ready to leave your authentication services, acl, patching procedures, tech stack choice, and network monitoring and management in the hands of the government? Because if you are not, you are asking the government to perform duties without the necessary means.
Value was destroyed in both cases. Users having their private data stolen have been harmed, the company's brand value is harmed, and they may lose users over this.
> or lives lost
Lives can be lost and real people can be harmed if their private information is stolen and used against them. There are dissidents and journalists in repressive countries whose safety depends on information security.
The digital domain is fundamentally lower stakes and harder to protect than the physical one. It is good that we do not respond to cyber attacks like we do physical ones because we would have escalated to nuclear war over a decade ago. The scope and volume of cyberattacks is very high but my understanding is that the US has a correspondingly high volume of outbound attacks as well.
Fundamentally? A power plant exploding or dam collapsing would kill way more people and cost far more in property damage than a single FedEx airplane with two crew being shot down.
Those all (currently) require a lot more than stealing a key from M$. Maybe stuxnet would be a better example for your point? Those uranium centrifuges Iran had were very expensive.
I have no idea how you would get a dam to collapse with only a laptop and a network connection. As for the power plant, the operators would have to be blind and deaf to let a plant get destroyed.
The real threat is a cascading power grid failure due to undersupply, e.g. coordinated forced plant shutdowns. A few days without electricity at a large scale means reduced availability of medical and emergency services, no running water, failing refrigeration, no stoves/ovens for cooking for most of the population, no working gas pumps, no electronic payment, no banking (no way to get cash) etc.
>I have no idea how you would get a dam to collapse with only a laptop and a network connection.
In a world where Stuxnet took out uranium centrifuges, and we've had actual PoC's of exploits that resulted in generators fragging themselves, I find your statement to be of the most shocking form of naivete I've heard in a while.
And in point of fact, the network connection would probably be for disabling alarms and control systems in order to mask work done to weaken the integrity of the structure itself. Physical and digital is inextricably linked.
A decently powerful generator is a massive machine. There is simply no way that it can destroy itself without causing abnormal behavior that will be noticed by on site personnel - noise, vibrations etc.
Key: "It seems they were used to the high levels of vibration" - Diane Vaughan wrote an important book that introduced the term "normalisation of deviance" as a factor in the Challenger Launch Decision, a more famous complex accident.
… and Russian, Chinese, French, whoever private entities have to defend themselves against the NSA, CIA, GCHQ…
Espionage is a dirty game.
What I always find interesting is how the US has taken on a strategy of indicting individual Chinese/Russian hackers for acting in the interests of their countries, whenever they can be identified by DoJ.
This policy is interesting, because, as we all know, turnabout is fair play.
How long before retired NSA operators are advised to never travel outside the US lest they be at risk of being picked up on international arrest warrants from China?
I'd say NSA probably already have such policy from decades ago. And all intelligence agency worthing their salt should more or less have done the same.
Very much no. CISA defends the federal executive branch and advises critical infrastructure. They don’t and shouldn’t have a proactive role in defending private companies.
They’re explicitly forbidden from doing things like that. As they should be; do you really want the government to have access to the kind of private corporation data they would need in order to defend them?
None of the reports mention if two stage authentication or any other extra factor authentication that enterprise accounts would be secured with were bypassed too. Am I right to assume that because the attacker had the signing key all of the extra authentication mechanisms that would have been enabled on accounts were bypassed by the attacker (because the attacker could create a token that bypassed all the extra authentication methods)?
And I presume there has been no known dump of e-mails exfiltrated during this attack?
Because it was a signing key that was stolen, the attackers could move straight to the post-authentication phase and forge authorization tokens.
Those email accounts could have had multiple authentication factors enabled, other conditional access policies applied (geo-location, device trust, time of day etc)… all of which were skipped over.
> Am I right to assume that because the attacker had the signing key all of the extra authentication mechanisms that would have been enabled on accounts were bypassed by the attacker...?
Doesn’t seem like an operational issue, seems more like 99.9% design coverage of preventing these issues from arising.
I will say, I’m not very satisfied with just “improved security tooling.” My gut is telling me that there is a better solution out there to guarding against credential leakage, but I feel wrangling memory dumps to have “expected” data is a fools errand.
This is not confidence inspiring. These fixes are only surface-level, rather than looking at the underlying systemic failures:
- Not storing key data in an HSM.
- Exporting crashdumps outside of the production account.
Redacting private keys and other sensitive data from these will always be failure-prone. Keys are also only just one problem, there will be personal data, passwords etc. in those dumps depending on what process it was.
- Corp environment infiltration not detected at the time (presumably, since this part is pure guesswork)
- Not enough log retention in the corp environment to track a 2 year old infiltration.
- Not assuring that key validation correctly denied requests with the wrong scope.
Fails-with-a-valid-key-with-wrong-(scope/date/subject etc.) are the kind of cases that always deserve test coverage, especially for a dedicated key-validation endpoint. Also concerning that this wasn't found by manual-testing/red-teaming/pen-testing in the time since it shipped.
- Slow detection, slow response, poor communication
The lack of scope checking seems especially egregious. It sounds like any number of keys would have been incorrectly trusted. If it was an RSA key that signed JWTs which it sounds like or similar, Microsoft has an issuer endpoint for all customers and it's critical to check the issuer/scope for those since any number of things can create a token with a valid signature.
> - Not enough log retention in the corp environment to track a 2 year old infiltration.
It didn't say that Microsoft couldn't identify that infiltration had occurred just that they didn't retain the logs to prove to exfiltration. That makes a lot of sense, maintaining access logs is one thing but to retain the detailed logging of every file action by every user on a 100k+ user corporate network long-term would be a massive amount of storage, of fairly limited value.
Even in this case, it might be nice to have but it wouldn't change any of the major findings you care about if you are Microsoft: that a bug allowed a key to be written to a dump file, that the scanning tools didn't detect the key in the dump file, and that the authentication process didn't properly check the keys.
I was more talking about the problems of moving data across trust zones rather than retaining it at all. Retaining logs and crashdumps for several years is good, but moving them from a locked-down production environment to a less secure corp account (where they were presumably easier to work with because of the lower security requirements) is why this leak happened.
Moving sensitive data to a less secure environment is of course a mistake, but leaks happen even with locked-down production environments.
It's very unlikely anyway a software company will fix an old crashdump - the software probably moved on a lot. So if there's no specific report/problem it's attached to and it's not a sensitive area, we're better off having it deleted. Same thing for old logs.
The list of failures is so long how much security engineering took place at Microsoft?
It is possible the crash that contained the race condition was planned. But how would anyone know the race condition existed? We know now so it is likely, based upon the successful attack, it was known prior to the key being exfiltrated. How many of our organizations have the same race condition and our keys are/have been vulnerable? No one else has keys that are valid for more than a year, right?
Let's say the race condition was unknown to the attacker. How can the attacker find the key in the dump? Did they have a scanning tool looking for key data? If so, how long was that running on MS network before it found the keys in the dump file? How many of our organizations' keys are in dump files? Are they all expired?
It works this way by design. Most companies will retain logs for exactly as much time as legally required (and/or operationally necessary), then purge them so they don't show up in discovery for some lawsuit years down the line.
It has nothing to do with discovery or legal liability and everything to do with cogs. Log size at cloud provider scale is genuinely something you have to see to believe; recall that these are logs for a company with multiple services that see 9-figure daily active users.
This is the real answer. The amount of logs generated at cloud provider scale now are massive compared to what they were just a few years ago. The last time I was involved in these sorts of systems, circa 2014, logging was one of the core functions at a cloud provider that was /most/ demanding of physical hardware, everything from compute, memory, and storage, all the way to networking. A typical server in the environment in that provider in 2014 would have 2x10GigE connections set up for redundancy, log servers needed a minimum 2x40GigE connections /for throughput/.
These days I wouldn't be surprised if they are running 100GigE or 400GigE networks just for managing logs throughput at aggregation points.
we’re talking an intrusion to the corp network not to the prod one (getting the keys from the crash dump)
I assume that’s a way smaller scale. However the document doesn’t go into detail which kind of logs exactly they were missing, so maybe these were network logs
For each piece of PI/PII data, generate a mapping in a table of that piece to a secure random number, and store the generated random number in place of the personal data, and use that in the log.
Then, if deletion is required, simply erase the row that holds the mapping.
And finally, be sure to not store that mapping table in the same place as your backups or your logs.
There are multiple regulatory reasons why logs in general (outside of specific use cases) are hard to retain indefinitely. You can document a security use case that triggers indefinite retention for logs based on some selector, but then you run into the problem that they say happened here: your selector is inexact and misses stuff.
> The key material’s presence in the crash dump was not detected by our systems (this issue has been corrected).
Now hackers have it even easier to find valuable keys from otherwise opaque core dumps: Microsoft's corrected detection software will tell them as soon as it finds one.
While true that it is easier for malicious actors to find this kind of thing with a tool that goes DONG! after a quick scan, its not as if the previous security through obscurity of "key hidden in megs or gigs of crashdump" was much of a blocker for a suitably motivated adversary.
Wonder if the actor caused the crash of the system in the first place?
Or it was crashing so often they didn’t have to.
Race condition to scrub the crashdump sounds fishy. When the system is crashing it’s hard to make assumptions or have any guarantees any cleanup and scrubbing is going to happen.
Regarding: “requires detailed knowledge of internal infrastructure”
MSFT decided to move their APAC support center for Office365 to China. If there is an issue in APAC engineers usually respond from there. It makes sense regarding time zone and cost.
However, people change jobs etc. Hence this setup will result in a number of good engineers with very detailed knowledge of those systems who live in China.
On the more tinfoil end of ideas: if the office for this work is in a particular country it brings corresponding risks for physical opsec of the infra.
I like how the fact that Microsoft's normal corporate environment (ie. the one with Internet connection) being compromised is casually considered as a secondary issue here.
>Our investigation found that a consumer signing system crash in April of 2021 resulted in a snapshot of the crashed process (“crash dump”). The crash dumps, which redact sensitive information, should not include the signing key. In this case, a race condition allowed the key to be present in the crash dump (this issue has been corrected).
Correction is good, but why can't they go one more step and allow everyone to scan their server minidumps for crash-landed keys?
> Are we still budgeting storage like it’s the 1990s for logs?
Retention policies are not necessarily about storage space; sometimes, they are there to avoid being required to provide that old data during lawsuits.
Retention policies at cloud providers are 100% about storage space (and accompanying cost). At companies like Microsoft saying “reduced cogs” is a very reliable way to get bonuses.
Huh? Logs are how you run the service. Cost is what keeps you from retaining more of them. Since you seem not to be familiar with the subject, retention policy is something of a misnomer; it would be more accurate to call it a deletion/destruction policy. The default is retain everything forever.
There are no legal considerations here. That’s what lawyers are for, and big tech has a ton of lawyers.
If you work somewhere that people are deleting things to keep them from being discoverable, run away as fast as you can.
Yes, logs are important for running the service. You were supposed to realize when you said that, that “is this useful” is a factor in how long you keep logs, alongside “how much does it cost”.
I do work at big tech and see a lot of logging policies, in fact. Having lawyers involved doesn’t make it “no legal considerations”, it makes it properly assessed legal considerations. It is default practice across big tech (and other large professional industries) to delete eg email as soon as reasonable in order to reduce discoverable records. It is also required for most services that they be able to delete logs and archived material to comply with the laws about keeping personal data around for no purpose - most places will never see this brought up but if they do, it can be a big deal.
The crash dumps, which redact sensitive information, should not include the signing key. In this case, a race condition allowed the key to be present in the crash dump (this issue has been corrected).
Can someone explain a bit about what kind of race condition this could be and how it exposes the signing key (in this case seems to be exposed without encryption)?
I feel like there is a lot missing from this writeup, but I can't put my finger on exactly what.
Also it feels strange that Government doesn't have its own signing key and they just use the same as everyone else. Which they didn't address and apparently do not intend to change.
if the government had its own key, you could trace anything they signed. Governments likely want code and other stuff they sign to appear as if another actor signed it
IIUC in general they do. One of the steps of this failure is that a key that had no business signing off on accessing government data was granted that scope by MS's cloud software because they changed the scope-checking API in such a way that their own developers didn't catch the change ("Developers in the mail system incorrectly assumed libraries performed complete validation and did not add the required issuer/scope validation").
So instead of failing safe, lack of new code to address additional scope features "failed open" and granted access to keys that didn't actually have the right scope.
Occam's razor tells me that this exploit route as described in the press release sounds way too complicated and that too much luck must have been involved. Therefore I am sceptical if we will ever hear the real story.
I am very curious why Microsoft is insisting that the key itself was „acquired“ without having anything to show for it. The wording seems a little odd to me, the constant repetition even more so.
as far as I can tell, the only non-bug mistake here was allowing coredumps to leave production ever. if this is your attacker, you are pretty fucked no matter how good you are.
How banal can a software mistake be before we aren't allowed to besmirch the name of the devs involved? Is forgetting a test case a shameable offense? What about ignoring authentication? Rolling your own?
Turns out when you write APIs that access security related things, you have to treat everything coming in as a threat, right? Shouldn't that be table stakes by now?
We need a professional gatekeeping organization because the vast majority of us suck at our jobs and refuse to do anything about it.
I don’t understand the reflex for shaming. Everyone makes mistakes, and we are usually better off to understand & learn from it.
If the first instinct is to punish, people will not be helpful in identifying their own mistake.
Also, this is why companies like Microsoft have processes and systems to avoid such mistakes. They obviously failed here, but can be improved independently of the people involved.
IIRC, airline safety investigations run in that way quite successfully.
Airlines are actually a great point. Pilots get better and learn from past mistakes because they are REQUIRED by licensing agencies to LEARN about these past failures.
Our industry can't even manage the OWASP top ten with new grads. Surely we can try something slight different?
The issue was caused by a race condition in extremely complicated software. Good luck setting up a gatekeeping organization that can track that level of detail (and understand every dimension of a possible fault like this).
I wouldn't expect the gatekeeper to track these issues but rather to sign the credentials that developers have. Then the individual developers (ostensibly) have a base level of training that set them up to more likely avoid these issues.
I don't know if you appreciate the level of complexity here. We're talking about a core diagnostics system (extremely complex software), that already has guards in place to protect against this stuff (complex again), but there was a race condition (complex again), and this one instance in likely billions and billions of transactions is what led to an issue.
What training do you think could have prevented this? Microsoft deals with complex software at enormous scale. Bugs happen. And in this case, it was a severe one that's already been dealt with.
People will always make mistakes. Thats why it's better to focus on processes that should've been built to catch or stop mistakes, especially mistakes by a single person.