> Conceptually, during a door invocation the client thread that issues the door procedure call migrates to the server process associated with the door, and starts executing the procedure while in the address space of the server. When the service procedure is finished, a door return operation is performed and the thread migrates back to the client's address space with the results, if any, from the procedure call.
Note that Server/Client refer to threads on the same machine.
While I can see performance benefits of this approach, over traditional IPC (sockets, shared memory), this "opens the door" for potentially worse concurrency headaches you have with threads you spawn and control yourself.
Has anyone here hands-on experience with these and can comment on how well this worked in practice?
IIUC, what they mean by "migrate" is the client thread is paused and the server thread given the remainder of the time slice, similar to how pipe(2) originally worked in Unix and even, I think, early Linux. It's the flow of control that "conceptually" shifts synchronously. This can provide surprising performance benefits in alot of RPC scenarios, though less now as TLB, etc, flushing as part of a context switch has become more costly. There are no VM shenanigans except for some page mapping optimizations for passing large chunks of data, which apparently wasn't even implemented in the original Solaris implementation.
The kernel can spin up a thread on the server side, but this works just like common thread pool libraries, and I'm not sure the kernel has any special role here except to optimize context switching when there's no spare thread to service an incoming request and a new thread needs to be created. With a purely userspace implementation there may be some context switch bouncing unless an optimized primitive (e.g. some special futex mode, perhaps?) is available.
Other than maybe the file namespace attaching API (not sure of the exact semantics), and presuming I understand properly, I believe Doors, both functionally and the literal API, could be implemented entirely in userspace using Unix domain sockets, SCM_RIGHTS, and mmap. It just wouldn't have the context switching optimization without new kernel work. (See the switchto proposal for Linux from Google, though that was for threads in the same process.)
There isn't a door_recv(2) systemcall or equivalent.
Doors truly don't transfer messages, they transfer the thread itself. As in the thread that made a door call is now just directly executing in the address space of the callee.
> Doors truly don't transfer messages, they transfer the thread itself. As in the thread that made a door call is now just directly executing in the address space of the callee.
In somewhat anachronistic verbiage (at least in a modern software context) this may be true, but today this statement makes it sounds like code from the caller process is executing in the address space of the callee process, such that miraculously the caller code now can directly reference data in the callee. AFAICT that just isn't the case, and wouldn't even make sense--i.e. how would it know the addresses without a ton of complex reflection that's completely absent from example code? (Caller and callee don't need to have been forked from each other.) And according to the Linux implementation, the "argument" (a flat, contiguous block of data) passed from caller to callee is literally copied, either directly or by mapping in the pages. The caller even needs to provide a return buffer for the callee's returned data to be copied into (unless it's too large, then it's mapped in and the return argument vector updated to point to the newly mmap'd pages). File descriptors can also be passed, and of course that requires kernel involvement.
AFAICT, the trick here pertains to scheduling alone, both wrt to the hardware and software systems. I.e. a lighter weight interface for the hardware task gating mechanism, like you say, reliant on the synchronous semantics of this design to skip involving the system scheduler. But all the other process attributes, including address space, are switched out, perhaps in an optimized matter as mentioned elsethread but still preserving typical process isolation semantics.
If I'm wrong, please correct me with pointers to more detailed technical documentation (Or code--is this still in Illuminos?) because I'd love to dig more into it.
I didn't imply that the code remains and it's only data that is swapped out. The thread jumps to another complete address space.
It's like a system call instruction that instead of jumping into the kernel, jumps into another user process. There's a complete swap out of code and data in most cases.
Just like with system calls how the kernel doesn't need a thread pool to respond to user requests applies here. The calling thread is just directly executing in the callee address space after the door_call(2).
> Did you mean door_call or door_return instead of door_recv?
I did not. I said there is no door_recv(2) systemcall. The 'server' doesn't wait for messages at all.
I think what doors do is rendezvous synchronization: the caller is atomically blocked as the callee is unblocked (and vice versa on return). I don't think there is an efficient way to do that with just plain POSIX primitives or even with Linux specific syscalls (Binder and io_uring possibly might).
The thread in this context refers to kernel scheduler thread[1], essentially the entity used to schedule user processes. By migrating the thread, the calling process is "suspended", it's associated kernel thread (and thus scheduled time quanta, run queue position, etc.) saves the state into Door "shuttle", picks up the server process, continues execution of the server procedure, and when the server process returns from the handler, the kernel thread picks up the Door "shuttle", restores the right client process state from it, and lets it continue - with the result of the IPC call.
This means that when you do a Door IPC call, the service routine is called immediately, not at some indefinite point in time in the future when the server process gets picked by scheduler to run and finds out an event waiting for it on select/poll kind of call. If the service handler returns fast enough, it might return even before client process' scheduler timeslice ends.
The rapid changing of TLB etc. are mitigated by hardware features in CPU that permit faster switches, something that Sun had already experience with at the time from the Spring Operating System project - from which the Doors IPC in fact came to be. Spring IPC calls were often faster than normal x86 syscalls at the time (timings just on the round trip: 20us on 486DX2 for typical syscall, 11us for sparcstation Spring IPC, >100us for Mach syscall/IPC)
EDIT:
[1] Some might remember references to 1:1 and M:N threading in the past, especially in discussions about threading support in various unices, etc.
The "1:1" originally referred to relationship between "kernel" thread and userspace thread, where kernel thread didn't mean "posix like thread in kernel" and more "the scheduler entity/concept", whether it was called process, thread, or "lightweight process"
Sounds like Android's binder was heavily inspired by this. Works "well" in practice in that I can't recall ever having concurrency problems, but I would not bother trying to benchmark the efficiency of Android's mess of abstraction layers piled over `/dev/binder`. It's hard to tell how much of the overhead is required to use this IPC style safely, and how much of the overhead is just Android being Android.
Not sure which one came first, but Binder is direct descendant (down to sometimes still matching symbol names and calls) of BeOS IPC system. All the low level components (Binder, Looper, serialization model even) come from there.
From what I understand, Sun made their Doors concept public in 1993 and shipped a SpringOS beta with it in 1994, before BeOS released, but it's hard to tell if Sun inspired BeOS, or of this was a natural solution to a common problem that both teams ran into at the same time.
I'd expect convergent evolution - both BeOS team and Spring team were very well aware of issues with Mach (which nearly single-handedly coined the idea that microkernels are slow and bad) and worked to design better IPC mechanisms.
Sharing of scheduler slice is an even older idea, AFAIK, and technically something already done whenever you call a kernel (it's not a context switch to a separate process, it's a switch to different address space but running in the same scheduler thread)
Binders has been in mainline kernel for years, and some projects ended up using it, if only to emulate android environment - both anbox and its AFAIK successor Waydroid use native kernel binder to operate.
You can of course build your own use (depending on what exactly you want to do, you might end up writing your own userland instead of using androids)
As far as I understand, it is already mainlined, it's just not built by "desktop" distributions since nobody really cares - all the cool kids want dbusFactorySingletonFactoryPatternSingletons to undo 20 years of hardware performance increases instead.
Be Book, Haiku source code, and yes Android low level internals docs.
A quick look through BeOS and Android Binder-related APIs will quickly show how Android side is derived from it (through OpenBinder, which was for a time going to be used in next Palm system based on Linux, at least one of them)
Think of it in terms of REST. A door is an endpoint/path provided by a service. The client can make a request to it (call it). The server can/will respond.
The "endpoint" is set up via door_create(); the client connects by opening it (or receiving the open fd in other ways), and make the request by door_call(). The service sends its response by door_return().
Except that the "handover" between client and service is inline and synchronous, "nothing ever sleeps" in the process. The service needn't listen for and accept connections. The operating system "transfers" execution directly - context switches to the service, runs the door function, context switches to the client on return. The "normal" scheduling (where the server/client sleeps, becomes runnable from pending I/O and is eventually selected by the scheduler) is bypassed here and latency is lower.
Purely functionality-wise, there's nothing you can do with doors that you couldn't do with a (private) protocol across pipes, sockets, HTTP connections. You "simply" use a faster/lower-latency mechanism.
(I actually like the "task gate" comparison another poster made, though doors do not require a hardware-assisted context switch)
Well, Doors' speed was derived from hardware-assisted context switching, at least on SPARC. Combination of ASIDs (which allowed task switching with reduced TLB flushing) and WIM register (which marked which register windows are valid for access by userspace) meant that IPC speed could be greatly increased - in fact that was basis for "fast path" IPC in Spring OS from which Doors were ported into Solaris.
I was (more) of a Solaris/x86 kernel guy on that particular level and know the x86 kernel did not use task gates for doors (or any context switching other than the double fault handler). Linux did taskswitch via task gates on x86 till 2.0, IIRC. But then, hw assist or no, x86 task gates "aren't that fast".
The SPARC context switch code, to me, always was very complex. The hardware had so much "sharing" (the register window set could split to multiple owners, so would the TSB/TLB, and the "MMU" was elaborate software in sparcv9 amyway). SPARC's achilles heel always were the "spills" - needless register window (and other cpu state) to/from memory. I'm kinda still curious from a "historical" point of view - thanks!
The historical point was that for Spring OS "fast path" calls, if you kept register stack small enough, you could avoid spilling at all.
Switching from task A to task B to service a "fast path" call AFAIK (have no access to code) involved using WIM register to set windows used by task A to be invalid (so their use would trigger a trap), and changing the ASID value - so if task B was already in TLB you'd avoid flushes, or reduce them only to flushing when running out of TLB slots.
The "golden" target for fast-path calls was calls that would require as little stack as possible, and for common services they might be even kept hot so they would be already in TLB.
So if I understand it correctly, the IPC advantage is that they preserve registers across the process context switch, thereby avoiding having to do notoriously expensive register saves and restores? In effect, leaking register contents across the context switch becomes a feature instead of a massive security risk. Brilliant!
Why would you care who spawned the thread? If your code is thread-safe, it shouldn't make a difference.
One potential problem with regular IPC I see is that it's nondeterministic in terms of performance/throughput because you can't be sure when the scheduler will decide to run the other side of whatever IPC mechanism you're using. With these "doors", you bypass scheduling altogether, you call straight "into" the server process thread. This may make a big difference for systems under load.
Note that Server/Client refer to threads on the same machine.
While I can see performance benefits of this approach, over traditional IPC (sockets, shared memory), this "opens the door" for potentially worse concurrency headaches you have with threads you spawn and control yourself.
Has anyone here hands-on experience with these and can comment on how well this worked in practice?