After reading many articles about virtual memory and how kernel space is mapped into every process I don't understand why it is necessary. Why can't process only have mapped it's user mode space? Also it only seems to be case of unixes and windows. Not sure how exactly it's done in OSX but "Mac OS X does not map the kernel into each user address space, and therefore each user/kernel transition (in either direction) requires an address space switch." https://flylib.com/books/en/3.126.1.91/1/
On x86, I t was presumably for performance, so that the TLB does not have to be flushed when switching from user to kernel mode. x86 requires some kernel memeory to be mapped always, for example the stack for syscall and trap handlers. So by keeping everything mapped into memory, the kernel did not have to worry about which parts were needed to handle syscalls and which were not. These kernel pages were marked as “supervisor only”, so only the kernel code could actually read and write them.
I say all of this in the past tesnse, since Meltdown makes it possible to read all that kernel memory. Kernels now keep most of the kernel memory unmapped when user mode is executing.
>"x86 requires some kernel memeory to be mapped always, for example the stack for syscall and trap handlers."
Can you elaborate on what you mean be x86 requires that the kernel stack always be mapped into a process address space in order for system calls?
The kernel always knows where a process's kernel stack is located as there is a pointer to it in the user process's task_struct. It is only in kernel mode that the kernel switches the CPU's stack pointer to use that that processes kernel stack.
You can't unmap the kernel stack in Meltdown mitigation, because the syscall instruction will want to push to the kernel stack before you as the kernel has a chance to map the kernel stack.
It sounded like the OPs comment wasn't strictly about the post-meltdown era and that they were commenting on the general case. But maybeI misinterpreted that?
OK, in the context of 'why can't you cleanly have the kernel in a different address space from user processes on x86', the same reasons apply. It's a chicken/egg thing, as a syscall instruction executes and touches the kernel stack before you have a chance to change mmu mappings.
There are versions of Darwin for x86 (but no released versions of full OSX AFAIK) that separate the address spaces, but they reserve a (albeit much smaller) piece of virtual address space at the top for the kernel in all address spaces in order to facilitate the transition to the full kernel address space.
This is it. Most of the RISC chips had a ASID tag with the MMU metadata that allowed you to switch address spaces without flushing the TLBs, but x86 added this super late. It ended up being added on the second round of virtualization extensions on x86 (and it's different between AMD and Intel).
>It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data."
This isn't strictly true though is it? It was my understanding even mmap() MAP_ANONYMOUS used a file interface, and that the way the kernel creates anonymous maps is by creating an instance of /dev/zero in tmpfs. Although I believe the file descriptor might be ignored however.
if the process depicted in the diagram were to start a second thread, where would that second thead’s stack go in the diagram? The two threads would share the same heap.
Thanks. So can I check I'm understanding correctly
- If a process has many threads, their stacks are all located within a single virtual address space corresponding to the user process?
- If one thread grows down and is about to overwrite the top of another thread's stack, does the OS detect this automatically and do some sort of reallocation procedure?
> If a process has many threads, their stacks are all located within a single virtual address space corresponding to the user process?
Yep!
> If one thread grows down and is about to overwrite the top of another thread's stack, does the OS detect this automatically and do some sort of reallocation procedure?
The kernel reserves a 8MB region for each stack and that's it (even the initial stack). So you wouldn't get overlapping stacks per se; the regions are preallocated. The kernel does try to detect stack overflow/underflow with guard pages, but that's just a best attempt kind of thing, and of you underflow by more than page you can just end up just corrupting memory.
And all of this is for C's sort of standard model. 'Split stacks' is a scheme closer to your second question, but there's a lot of overhead of that model, and not a lot of runtimes use it.
I've also even more rarely seen a model that allocates stack frames on the heap and links them together in a linked list.
But like I said, these schemes are very in practice.
You could still make an argument for that if you squint hard enough. The virtual memory is still reserved, and a transition to kernel mode still has user space mapped, and the kernel's view of memory as well.
Thanks for clarifying! Is whatever the kernel does now in transition to user space expensive because it's somehow proportional to the amount of actual memory that the kernel is using or has reserved?
The Meltdown fixes change the model to unmap most of the kernel when switching to user mode. The issue with Meltdown was that it was possible to read memory through timing side channels that was technically mapped, but permissions shouldn't allow you to touch.
(or click "past" under the title, also helpful to check when you submit a link)