Hacker News new | past | comments | ask | show | jobs | submit login
A call to reconsider memory address-space isolation in Linux (lwn.net)
218 points by chmaynard on Sept 30, 2022 | hide | past | favorite | 105 comments



Windows already does this for a number of years.

Drivers which run in kernel space are not allowed anymore to access whatever they want, and Windows own kernel space data structures are protected against modifications by other kernel mode running code (kernel patch protection).


And Windows needed that feature much, much earlier.

One of the open secrets about windows is that in the 3.1 to 98 era, quite a large percentage of system crashes were actually caused by Creative Labs’ audio drivers. Those guys could not produce stable software if their lives depended on it.

But I don’t blame CL for Windows being crash prone. Microsoft made a choice and a compromise to get popular. A permissive driver model made them more popular with customers. If you pander, then the rewards and the consequences are both yours to enjoy.

MS tried to have their cake and eat it too back then. They wanted everyone to think they were the most sophisticated and powerful company because they were the smartest company in the world (which was the internal dialog at the time according to insiders I interviewed), but at the same time that it was all dumb luck and they couldn’t control anything.


Many successful companies and individuals fall into that trap of being unable to differentiate between talent and luck. I'm certainly not saying Microsoft doesn't employ plenty of very talented individuals. But it also takes some fortunate twists of fate to succeed, which were completely out of Microsoft's control. Success is never guaranteed.


Probably didn't help that Microsoft was founded at the end of the Corporate Raider era either. If you ran a company and decided that one of your big successes was being in the right place at the right time, you might keep a bigger war chest to get you through your next bad luck window. But that pile of liquid assets paints a big bullseye on your forehead.

It was 'better' to just assume you were awesome and hope your luck held for years on end. And if it didn't, then you could tell a story about how talent got you big and bad luck took you out. No, it was luck both ways or skill both ways.


The same was true with Windows Vista and Nvidia graphics drivers

https://www.engadget.com/2008-03-27-nvidia-drivers-responsib...


Audio drivers being the bane of Windows stability didn't truly stop until Microsoft took Creatives toys away by force with Vista's new audio stack. Instead the new GPU drivers indeed took over that role, but at least more temporarily.


Intel Rapid Storage technology is another absolute trash piece of software.


Fortunately, Microsoft is effectively killing off those storage driver messes by having DirectStorage only support the standard Microsoft NVMe driver


That's super interesting. Does it use some special CPU feature? The CPU usually let code running in kernel context do whatever it wants.


In Windows 10/11 the core of the Windows kernel can run in a virtual machine totally separated from the rest of the kernel.

> HyperGuard takes advantage of VBS – Virtualization Based Security

> Having memory that cannot be tampered with even from normal kernel code allows for many new security features

> This is also what allows Microsoft to implement HyperGuard – a feature similar to PatchGuard that can’t be tampered with even by malicious code that managed to elevate itself to run in the kernel.

https://windows-internals.com/hyperguard-secure-kernel-patch...


Very nice. Windows kernel devs is of the few good things Microsoft retains.


Not a Windows Kernel Dev. But my understanding is it's more a tripwire than anything else unless virtualization based security is turned on. If that is activated then the Kernel has complete isolation from non-MS drivers and can prevent them from accessing critical data structures. MS has a list of known drivers that don't work with this and prevents users from activating it if it will break things.


Ya, you can look at the bugcheck codes and see the mechanism that does this. Since patchguard will always throw that bugcheck code, I think it's 0x109? It just does random scans and sees if it matches, it's nothing fancy. Even with VBS(virtualization based security) it functions the same and will still allow a driver to modify it, then crash. In windbg you can see this by "!analyze -show 0x109" assuming that its 0x109.


I think VBS's role is ensuring you can no longer patch the PatchGuard itself? Because the guard itself is no longer in the kernel and you can do nothing with it.

But I heard VBS has a ~10% overhead compared to not enable it. I wonder what does cost this. Enable hyperv itself didn't really cause observable difference though.


VBS's role is to mirror the kernel and wall it off through a hypervisor. So your kernel/usermode can't access the secure version. This basically lets it compare the "secure" kernel and the regular kernel structures. Things like the process list, Driver executable regions, signatures, and such are mirrored. So when a process spawns and it's added to the process/threadlist. Those operations are mirrored in the secure kernel then randomly checked for security.

VBS also secures things like the scan timer/event and some other methods people used to use to disable it. http://uninformed.org/index.cgi?v=8&a=5&p=18 .

The performance impact shouldn't really be noticeable at all. All you have is some memory operations which are "Duplicated", but not really since COW. But i'm not that much of an expert on patchguard besides the really basic functions.


I'm curious what software is telling the kernel no. What enforces this?


Why the kernel of course (joke attempt, I am wondering too)


Probably firmware/hardware.


It's the type 1 hypervisor it wants to run on top of.


Then would not the type 1 Hypervisor then become the "kernel" seing as we've defined kernels as "that chunk of code capable of unrestricted access to machine state"?


The line blurs for sure.

I would say it's 'a' kernel. The idea of there only being one kernel is probably a concept that makes for nice layered diagrams, but doesn't come close to describing reality because of the combinatorial complexity of options for different morphs of layering. Sort of like the OSI network layers model in that way.


The memory mapper. One of the side benefits of relocatable code is the ability to enforce policy at point of access.


Windows actually uses a cpu feature for kernel patch protection right? I remember trying to figure out why Linux doesn't.


On modern Windows, it actually always runs as guest on Hyper V, and thus many such protection mechanisms ping back on virtualization.

Secure kernel and driver guard are another features with similar protection level.


Are you talking about VBS/HVCS? Isn't that optional or is it on by default for kernel stuff?


Optional on Windows 10, compulsory on Windows 11.

One of the reasons for the hardware requirements.


Wow, didn't know thanks.


How can code be running in kernel mode if it doesn’t have unrestricted memory access?


The page tables can be set up so kernel-mode code doesn't have access to all of memory. You could get around this by modifying the cr3 register to point to a different page directory, but that could cause problems whenever a context switch happens and cr3 is reverted. Microsoft also has PatchGuard, which could probably detect changes like that.

In theory you could work around all these protections, but it would be difficult and fragile.


> Drivers which run in kernel space are not allowed anymore to access whatever they want

I don't like this. It's one thing to have memory the kernel doesn't usually need be unmapped by default, but it's another to prevent you from mapping it when you do need it. This reeks of DRM.


Prior to Microsoft changing things with Vista, drivers were free to rummage about and break systems. Blue screens were a common thing back then because drivers weren't careful. It's why Vista had such a bad rap for breaking systems; It exposed the driver authors who didn't care about safety.

Also, why does every kernel-safety measure have to be seen as anti-user or DRM? You're free to disable it[0] if you wish, just like Apple's SIP. It's there to keep users who don't know anything safe. Would you like it if an innocent looking piece of software was able to hide itself from the user (read: you) and the OS?[1] Preventing such attacks doesn't sound anti-user to me.

[0]: https://windowsloop.com/disable-enable-device-guard-windows-...

[1]: https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...


> Also, why does every kernel-safety measure have to be seen as anti-user or DRM? You're free to disable it[0] if you wish,

OP's point was about kernel patch protection

how do I disable that?


> but it's another to prevent you from mapping it when you do need it

How do you know if it's a user needing to use it, or an attacker pretending to be a user needing to use it?


I'm willing to accept that if a malicious kernel driver gets loaded, I'm completely pwned.


No idea why you're being downvoted. This does sound like an anti-user feature. Stuff like this exists to protect third party software from our analysis and "tampering".


It is anti-driver-freedom.

Most users don't want driver-freedom. They want user-freedom, that is easily achieved by keeping the kernel open source and replaceable.

If you want to break some in-kernel protection, you can just patch the kernel and remove it.


> They want user-freedom, that is easily achieved by keeping the kernel open source and replaceable.

The original comment I replied to was about the Windows kernel.


Can you elaborate how this feels anti-user?


As the owner of the machine, I should have total access to everything. There should be no protected memory I can't read, no execution I can't trace. The number one user of such security features are "rights holders" that consider me hostile and want to protect their software from me, the owner of the machine it is running on. The result is a computer that is factory pwned. It's not really my computer, they're just "generously" allowing me to run software on it as long as it doesn't harm their bottom line.

Virtualization based security, the technology enabling this, also enables DRM. It is currently required by Netflix for 4k resolution streaming.


How do you propose having security in that world? Should any process be able to access the memory of any other process? Should you opening a web page allow you a security issue there to be able to access your running applications?

I guess my question is, you seem to have a very hardline “everything” take that I don’t think extends to the real world


You're confusing "I" should have access to all the data on my machine, with "anything running on my machine" should have access to all the data on my machine.


Ahh right, I forgot about the "user requested" security flag. Makes sense.


> Should any process be able to access the memory of any other process?

Any process? No. My processes? Yes. That includes "sensitive" memory like cryptographic keys.


What distinguishes your processes?


Still build the security protections but allow the administrator to selectively disable/override them.


If you're the admin of the system you can just load up your own kernel, or kernel modules, etc.


> I should have total access to everything

You do have total access. You chose to install an OS that wants to limit it.


Not a kernel developer. It sounds like a useful feature in some contexts(shared hosts, multi-tenant setups etc.), but useful to run without these constraints when it doesn't apply(HPC doing one thing using all available resources). Couldn't this feature be implemented as a feature flag in the kernel to say enable this if you need(and disable if you know what you're doing)?

As a side benefit, over time if smart folks find a way to reduce the overhead from this feature close to negligible levels, the feature flag would become unnecessary..


> Sometimes, though, there will be a need to access sensitive memory. An important observation here, Weisse said, is that speculative execution will never cause a page fault. So, if the kernel faults while trying to do something with sensitive memory, the access is known to not be speculative; the kernel responds by mapping the sensitive ranges and continuing execution.

So, is the isolation patch a real solution to speculative execution attacks? Or is it just adding another hurdle for the attacker to jump?

I.e. (if I interpret this correctly) the attacker would just be forced to add a "priming" stage, where they will trick the kernel to map the sensitive ranges they need for their speculative attack. So the effectiveness of the patch boils down to (a) the feasibility of this priming stage, and (b) for how long can the attacker keep the mapped memory in place.


I think the idea is that faults on sensitive pages are known to be good, because they can only come when running kernel code. They are unmapped when exiting from kernel code, so in theory there's no way for userspace to speculatively execute on those pages and leak sensitive memory.

It does seem to be a good fix for the root cause of the problem, but I'm skeptical about the details. I haven't look through the patch set, but narrowing down what's sensitive and what isn't is going to be a monumental task.


unmapped pages cannot be speculatively executed "into"? (due to TLB flush? which happens during unmapping?)


My understanding is that this is because the tlb update can't be speculated through. That means that once speculative execution hits that barrier it will stop and can't then look at data that would be mapped in if speculation kept going. Basically there's a hard fence there that it won't/can't go past so if you do hit a page fault in kernel space on one of the unmapped pages you know it will be from an active and not speculative thread running.


"Good" speculative execution shouldn't be able to speculate isolated memory - isolation does work against Spectre.

Meltdown-type exploits (Older Intel and arm) can "see" across this isolation in some cases, which is why they were such an egregious mistake. Spectre is kind of inevitable whereas Meltdown isn't.


2-14% performance drop sounds high, I understand people are sceptical. Maybe the subsystems can be structured in a different way to ease the effects of ASI?

Personally, I would rather see a kernel side MPU in feature CPUs


It's 2-14% offset potentially by whatever the cost is of the other mitigations that it replaces, for whatever that's worth.


I’m just going to keep turning all mitigations off, thanks.


You do you but I hope you've also turned off JavaScript in your web browser.


Is there any evidence of anyone actually attempting such an attack outside a carefully controlled research scenario? The number of stars that have to align for this theoretical attack to work in practice is so high that I don't think any normal desktop end-user has reason to worry about it. (Shared servers, co-hosted VMs, etc are a different story.)


True there isn't at the moment but do you really want to be the first to find out because you left the back door open?


I'd liken it to refusing to leave your house because you're worried a meteor may strike you down. You're giving up a lot for not much reason.


Not quite. A meteor is an object with no drive or desire. Attackers are adversarial and thoughtful. You can't liken a random event to a purposeful act like this.

It's more like saying "refusing to leave your house because you're worried you might get mugged", which might actually be very reasonable if you live in a bad neighborhood, you're a common target of crime, etc. It may not be reasonable for many others.

But analogies are pretty rough in general.


No, I think meteor is closer. Mugging is in the realm of possibility, this javascript driveby theoretical attack is wildly implausible. Like, take the time to spell it out: what sequence of events has to occur for this attack to be successfully pulled off?


Mugging isn't a good analogy also because software attacks, once possible, can be automated and distributed widely, are "wormable", and mugging is harder to do at scale.


Yes I hate analogies.


I was going to say a new virus emerging was a good example.

In both cases: We know they are theoretically possible. We cannot say for sure when they'll emerge. Once they emerge, they can spread by themselves and become common.

I guess the place it doesn't hold up is that in the exploit case, they're intentionally engineered, and for a virus, well there's the Wuhan bioweapon conspiracy theory but no, it's more like random mutations cause it.


Which attack are you referring to? There have already been POCs for many speculative execution attacks.


Whatever one Veliladon was referring to when they asserted one must run either mitigations or no javascript. My point is those POCs are not sufficient evidence that mitigations with real performance impact are justified for the typical desktop end-user. Again:

> The number of stars that have to align for this theoretical attack to work in practice is so high that I don't think any normal desktop end-user has reason to worry about it.


Oh ok. So, no, the stars don't have to align at all. The attack is straightforward, the POCs show that.

The reason we don't see these attacks is because everyone patched the major issues immediately. Further, attackers don't need to really go for these sorts of attacks, there are more reliable, well-worn methods for attacking browsers.


> So, no, the stars don't have to align at all. The attack is straightforward, the POCs show that.

Please spell it out for me. Suppose I'm a typical desktop user, how is important information going to be stolen if I have mitigations turned off and JavaScript enabled? What state does my browser have to be in, and what actions do I have to take (or not take) for the attack to succeed? What likelihood is it that someone has deployed an attack that meets those requirements?

> Further, attackers don't need to really go for these sorts of attacks, there are more reliable, well-worn methods for attacking browsers.

So we agree it's OK to leave mitigations off and browse the web?


> Suppose I'm a typical desktop user, how is important information going to be stolen if I have mitigations turned off and JavaScript enabled?

https://github.com/google/security-research-pocs/tree/master...

I don't imagine I'm going to explain it better than the many others who have already done so.

> What state does my browser have to be in, and what actions do I have to take (or not take) for the attack to succeed?

Your browser would have to be pretty old/ outdated since they've been updated to mitigate these attacks. Otherwise it's just necessary that you visit the attacker controlled website.

> What likelihood is it that someone has deployed an attack that meets those requirements?

That's not a simple question. Threat landscapes change based on a lot of factors. As I said earlier, we won't see these attacks because people have already patched and attackers have other methods.

> So we agree it's OK to leave mitigations off and browse the web?

You can do whatever you want, idk what you're trying to ask here. What is "OK" ? You will be vulnerable but unlikely to be attacked for the reasons mentioned. If you are "OK" with that that's up to you.


Yeah again, that's a carefully controlled research setup. These attacks are not going to dump your bank passwords straight to badguys.com. They're going to get some random chunks of memory that very probably don't contain anything of value. Browser renderer process don't contain contiguous memory chunks that say like, "BANK_PASSWORD_IS:asdf1234", it would take an incredibly amount of luck and further investigation of every single memory chunk retrieved to possibly gain anything of value. That's not how drive-by attacks work.

It's a really impractical attack outside of extremely targeted scenarios. It's not something real desktop end-users need to worry about. The mitigations slow down your system for zero benefit.

> You can do whatever you want, idk what you're trying to ask here. What is "OK" ?

Maybe re-read the thread from the start? The first guy I responded to was making an assertion that running without spectre/etc mitigations means you should turn off javascript.


> Yeah again, that's a carefully controlled research setup.

The POC runs in visitors browsers lol it's a public demo that runs in your browser, not in a "carefully controlled research setup".

> that very probably don't contain anything of value

Lots of things are valuable other than passwords. Even just leaking addresses can be useful for further exploitation. The main issue is it's a violation of a security boundary.

> The first guy I responded to was making an assertion that running without spectre/etc mitigations means you should turn off javascript.

They said "I hope you <do that>". Presumably because it would also mitigate the issue.


I am going to tell you guys something magical. There are multiple hundreds of millions of cryptocurrency running on hardware with spectre/meltdown mitigations off. Go get them.


Browser lowered timer resolution to mitigate most speculative execution attacks. Probably why there aren't useful exploits. I remember a js payload when the first spectre appeared.


I think there might be herd immunity effect to that: because mitigations are enabled by default, there isn't much focus for developing attacks for it.


I keep my important info on paper.


I've memorized my private key it's the only way to be safe in 2022.


Let’s test your memory, what is it?


md5(hunter2)


Why do you want to let driver memory bugs overwrite random kernel memory?


For most of us driver memory bugs are extremely rare and therefore a 2-14% performance drop is bonkers.


My AMD GPU driver bugs out pretty regularly. Hardware companies are not known for their bulletproof driver code.


Anything you're not testing regresses, and any driver you let out of jail is going to regress too.

Properly engineered security fixes don't cause performance regressions, either because you find improvements to pay for them, or you get the hardware updated to make them cheaper. (That'd be PCID in this case.)


Both of these cost money.


No. This is unavoidable. Mapping physical pages into a local page table is expensive no matter how you spin it. Depending on the data structures, this means doubling the time in some cases to allocate a physical page, which is typically 4KiB. Large allocations mean multiple page allocations. It adds up.


Correct me if I’m wrong, but wouldn’t complicated page mapping affect the cost of system calls the most? Because each call causes two alterations to the MMU state.

I wonder how this patch interacts with io_uring. If kernel call overhead keeps going up it will force people to find ways to avoid the overhead.


If I understand correctly the patch at the moment doesn't affect system calls at all, only entering and exiting virtual machines.


Personally I would rather we just lean more into microVMs. From a cloud computing perspective, I don't care what security Linux claims to have, because I just assume it is ineffective and move forward from there. We build systems using loosely-coupled, strongly-authenticated, temporary sessions, and it's very easy to enforce separation of concerns (and thus minimize attack surface).


your code may be secure, but your cloud neighbor or VM host may be not. if VM host machine is taken over, then your VMs are at risk too https://unit42.paloaltonetworks.com/azure-container-instance...


True, but it's pretty rare for hypervisors to be exploited. In that Azure case the node itself maintained a connection to each VM, which kind of defeats the purpose of using VMs for isolation.... It's like having strong memory guarantees in Linux and then having an open socket from a root-owned process to every other process... like, maybe giving them a new attack vector to a SPOF isn't a great idea


hypervisor exploitation exist, just as container escapes and every other runtime escape and escalation. Which is natural for cloud native app development.

remember the shared security model, the lower stack of infra depends on cloud vendor, and it critically depends on hypervisor/OS/container runtime security and configuration hardening


It would be nice to have a more trustworthy stack at every ring, I think.


If you care about security, Qubes OS is more secure, with its hardware virtualization: https://qubes-os.org.


If I understand correctly, OpenBSD has done this for years too. IIRC Around the time that the first Specter vulnerability was announced OpenBSD patched all those possible variations too.


Somewhere, Andrew Tanenbaum is chewing popcorn. Loudly.


Next in line, a plea for splitting the source tree into different repositories. Followed by a plea for stabilizing the core kernel API...

Then we can replace the core API with a secl4 emulator.


muh MINIX backdoor


That's already running on the Intel PCH.


This is a good opportunity for Rust!


Not arguing with you, but how? Rust is about memory safety, not memory isolation. It keeps me from writing out of bounds, but it does nothing to stop a malicious kernel module from stealing `SECRET_KEY` out of the kernel's local memory.


This is good for Bitcoin.


rust has absolutely nothing to do with prevention of side-channel attacks


I’m not smart enough to understand kernels, or how memory works at an operating system level, but does this have anything to do with Apple‘s recent announcement that all unallocated memory will be zeroed out?

I’ve been wondering, albeit naïvely, if TikTok and Facebook have been scraping unallocated memory allocations to achieve “without achieving” listening in on our microphones and cameras. Siri is always listening / camera sometimes pops up before accessing photos and the unallocated recordings / visuals could still be there? It could explain the uncanny awareness and specificity of their ads / algos.


Just in case the downvotes didn't answer your question: no. Memory was already zeroed out before it crossed process boundaries. Apple's recent change is about zeroing it out even before it's reused within a process. (Freeing memory doesn't always give it back to the operating system; the userland allocator usually keeps a list of free blocks to be reused quickly.)

Therefore, the attack you are describing doesn't work.


Thanks kind stranger, I appreciate your patience!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: