Several reasons why this is a terrible idea, aside from learning how things work:
(1) It only works on one architecture. Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.
(2) It's very complex and error-prone compared to using the pthread_* APIs (and they themselves are not exactly easy to use correctly).
(3) You're missing future bug fixes and optimizations.
(4) The kernel may move to a different API in future and glibc will handle that for you, but this code will still be using the old API.
The kernel has changed its threading APIs, the clone API, and how syscalls are made several times in the past, and you cannot predict how they will change in future.
The kernel has changed its threading APIs, the clone API, and how syscalls are made several times in the past, and you cannot predict how they will change in future.
I thought "don't break userspace" was one of the key principles of Linux design?
The old APIs continue to be supported, but they may be slower or have other limitations. eg You can continue to make syscalls through int $0x80 if you want, even on the latest x86, but performance will suck. You can keep using clone instead of clone3, but you won't get to use all the newer features.
Nothing guarantees that. clone() may still be provided by the kernel but nothing guarantees it will still be fast and not just a compatibility retrofit. Particularly since most users will move on automatically through libc.
> (1) It only works on one architecture. Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.
A portable solution (on any POSIX system, actually) for the stack switching would be to jump to the new stack using sigaltstack and initialize a sigjmp_t context (i.e. setjmp) from the signal handler. After calling clone you conditionally invoke longjmp based on the return value. This works because setjmp and longjmp are compiler builtins; but if not, use __builtin_setjmp, __builtin_longjmp. (You can pass the sigjmp_t context using the same trick mentioned in the writeup. And of course this also assumes the compiler won't emit stack-dependent code between the syscall and longjmp.)
You still have the problem of implementing the syscalls, though. But maybe if we scrounge around we can find solutions. On some architectures the vdso Linux maps into the process address space has a syscall implementation. And maybe there are similar compiler builtins for other architectures, or macros in the Linux header files with the necessary inline assembly.
As a learning exercise this type of stuff is fun ;)
setjmp and longjmp are not normally compiler builtins, just some library functions. They are quite different also from the __builtins that you mention, sometimes in quite surprising ways. Actually compilers are very buggy when you use these functions, even if you do follow their quite strict rules for what is allowed. Calling longjmp in a signal handler is technically undefined behavior in posix and trigger issues on some platforms that I have had the displeasure to run into. Recently various hardening options in the definitions of setjmp now make it detect these sorts of attempts and abort the program. So I wouldn’t call all of that very portable, but I agree it is fun!
The recently published book Rust Atomics and Locks by Mara Bos has an interesting section on why Rust uses the futex syscall directly.
> ... However, there are a few issues with that, as this pthread type was designed for C, not for Rust.
> In Rust, we move objects around all the time. ... The pthread types we discussed do not guarantee they are movable, which becomes quite a problem in Rust. Even a simple idiomatic Mutex::new() function is a problem: it would return a mutex object, which would move it into a new place in memory.
> A solution to this problem is to wrap the mutex in a Box. By putting the pthread mutex in its own allocation, it stays in the same location in memory, even if its owner is moved around. This is how std::sync::Mutex was implemented on all Unix platforms before Rust 1.62.
> The downside of this approach is the overhead: every mutex now gets its own allocation, adding significant overhead to creating, destroying, and using the mutex. Another downside is that it prevents the new function from being const, which gets in the way of having a static mutex. Even if pthread_mutex_t was movable, a const fn new could only initialize it with default settings, which results in undefined behavior when locking recursively. There is no way to design a safe interface that prevents locking recursively, so this means we’d need to make the lock function unsafe to make the user promise they won’t do that.
> A problem that remains with our Box approach occurs when dropping a locked mutex. ... pthread specifies that calling pthread_mutex_destroy() on a locked mutex is not guaranteed to work and might result in undefined behavior. One work-around is to first attempt to lock (and unlock) the pthread mutex when dropping our Mutex, and panic (or leak the Box) when it is already locked, but that adds even more overhead. These issues don’t just apply to pthread_mutex_t, but to the other types we discussed as well. Overall, the design of the pthread synchronization primitives is fine for C, but just not a great fit for Rust.
What alternative is there? Linux is a more minimalist OS than old-school UNIX, and libc as a userspace construct allows more evolution in that space (see glibc vs musl vs uclibc).
I could see an argument for small shims injected by a vDSO (AIUI this is what Fuchsia does), but clearly the 80s-style "libc is the OS boundary" is obsolete.
A user-mode library bundled with the kernel (like libc, but doesn't have to be libc) is the alternative to the syscall boundary. You maintain the library as the stable interface for the user, and let the syscalls change as you please.
Yes, I understand the mechanism, but why would you want to structure your OS like that?
You either have to maintain a stable syscall ABI, or a stable userspace function ABI, and it's not like one is materially more difficult than the other. If anything, the Linux approach has proven to be superior -- glibc has had multiple breaking changes over the years and to this day users will resort to Docker to run binaries that depend on "old" glibc, but programs written against the Linux syscall ABI twenty years ago will run fine to this day.
Even if you decide to have a userspace trampoline/shim, it doesn't make sense for that to be libc. The C standard library is huge and requiring userspace programs to link it regardless of implementation language does nothing except add forty years of bad ideas into your process's address space.
Multiple organizations ship Linux kernels which do not support the original x86-64 system call ABI at all. Old applications only keep running because these applications are dynamically linked against glibc, and the system glibc version is new enough to use the new system call ABI. Some Linux subsystems are curiously exempt from the ABI stability, or different ABIs can be provided through compile-time or system configuration. Either way, as an application developer, you cannot be sure which ABI will be available.
On the glibc side, it's challenging for us upstream developers because a lot of people assume that we do not aim to provide backwards-compatibility, so they do not bother reporting compatibility issues. I suspect this perception is there because one of the backwards compatibility mechanisms we use prevents running binaries built on newer systems on older systems. (New programs use a new symbol with a backwards-incompatible change, old programs get the old implementation.) But that is not actually about backwards compatibility, it's requesting forward compatibility.
The goal is to require recompilation only in limited scenarios: static libraries and object files not yet fully linked, deliberate dependencies on undocumented internals (e.g., internal struct offsets), and dependency on behavior that is not standards-conforming (e.g., non-sticky EOF on stdio streams). And in the latter case, it's often possible to add a kludge to maintain backwards compatibility.
If anything, the Linux approach has proven to be superior -- glibc has had multiple breaking changes over the years and to this day users will resort to Docker to run binaries that depend on "old" glibc, but programs written against the Linux syscall ABI twenty years ago will run fine to this day.
Windows takes the opposite approach (except that the libraries are gdi32/kernel32/ntdll/user32/etc.), and it's definitely an example of backwards-compatibility done well.
I think the case of Windows is slightly different. Its architecture has the same backwards compatibility challenges of GNU libc, but where GNU took a user-hostile approach ("just recompile"), Microsoft invested thousands of person-years to mitigate and resolve issues as they were discovered.
Most folks reading this are likely familiar with Raymond Chen's blog posts about compatibility shims for older binaries, or stories like the Windows 95 work around for memory-management bugs in SimCity. That's the kind of thing that happens when you design an OS ABI with problems, but are committed to bearing the cost of those problems yourself rather than passing them along to users.
I'm sure if you asked today's Windows team if they'd have a different approach, they'd rattle off a half-dozen great ideas for backwards-compatible syscall ABIs. There's been lots of research in that area since the foundations of our current OS ecosystem were laid down 30 years ago.
I think competing products define their interface in terms of calling a provided C library, and leave the entirely of the system call mechanics as an implementation detail.
For context, this is how it worked in old-school UNIX. If you look at (for example) SunOS, the kernel ABI is undocumented and everything in userspace depends on libc. The OSes and userspace languages of the day were deeply coupled -- C was the language of UNIX.
You can still see the remnants of this design in macOS, except instead of libc it's libSystem. Apple has decoupled their OS from C (in favor of Objective C, then Swift) but the idea of a userspace entry point to system functionality survived.
Linux is different from traditional UNIX because it's a kernel-only OS. There's no "Linux libc", and third parties have written multiple implementations of the C standard library that run on Linux. The only way this structure can work is for the kernel itself to provide stable ABI, and the standard approach at the time (mid 90s) was numeric syscall codes with positional parameters.
If you think glibc pthreads are not optimal then you could supply a patch and transparently fix literally thousands of applications and tens of millions of users. Otherwise I trust the glibc developers I work with who are extremely smart and focused on performance.
I highly recommend you take a look at https://webkit.org/blog/6161/locking-in-webkit/ then. It’s been known for a long time that pthread mutexes are very expensive and terrible. Unfortunately you can’t fix it AND maintain back compat (and you might lose technical compliance with POSIX). But that doesn’t mean that there aren’t drastically better locking primitives and it’s the epitomy of hubirus to think glibc is a great codebase. It’s just widely used and most people really don’t want to dive into writing their own libc. Literally every high performance application I’ve seen uses their own locking primitives, DNS alternatives etc. The only exception is when the locking isn’t a critical bottleneck.
Examples of the same concept in different contexts:
WTF lock
Folly locks
ABSEIL locks
Rust mutexes
I have more interesting projects to work on than glibc maintenance.
The reason WebKit rolls their own locks is because they have a specific needs that they can optimize around, which they do quite well. But a two-bit lock isn’t actually capable of supporting recursive locking, or fairness, or all the other things that people expect from a pthread mutex. You could say that most people don’t need those things and could get away with a more optimized lock, and you’d be right, but glibc supports the general case and does a fairly good job, so it’s a reasonable choice except for high-performance applications which end up rolling their own.
You don't need a larger mutex for fairness. For recursive locking, maybe. But most mutexes aren't recursive.
That's one of the big problems with pthread mutexes: they try to pack too much functionality into a single type. A pthread mutex can be recursive, so every mutex needs space to store a recursion count, even though most mutexes aren't recursive. A pthread mutex can have an arbitrary priority ceiling (pthread_mutexattr_setprioceiling), so every mutex needs space to store the priority ceiling, even though most mutexes don't set that attribute. A pthread mutex can be "robust" (pthread_mutexattr_setrobust), which at least under glibc means that every mutex needs previous/next pointers in it so it can be stored in a per-thread linked list, even though most mutexes are not robust. That's an excessive level of generality, when different kinds of mutexes could have just been different types!
The other big problem with pthread mutexes is just that they are old. I could be wrong, but my impression is that much of the interest in smaller mutexes has come about relatively recently, at least compared to the age of these APIs. It might be possible to shrink pthread_mutex_t to some extent despite the aforementioned issues, but it's impossible to do so on any existing operating system (at least on existing architectures) because changing the size of pthread_mutex_t would break ABI.
Another point about glibc implementation of pthread_mutex_t (and IIRC even the glibc internal lll_t that implements simple mutex as few lines of assembly involving futex(2)) is that on 32/64b platforms the layout of the structure is compatible between 32b and 64b ABIs. The reason for that is that pthread_mutex can have pshared attribute and also it is used in implementation of higher-level posix-IPC primitives (sane implementation of all of which involves just placing the particular synchronization struct into shared memory and not somehow translating that into SysV-IPC).
You'd need extra space to store the fact that the mutex is fair rather than unfair. (I agree with you that the API tries to cram too much into one interface, but that's really on POSIX rather than glibc.)
Look at the benchmarks. The fairness locks perform worse than if you didn’t have any fairness because you’re wasting a huge amount of CPU cycles trying to create that fairness.
Recursive locks are generally acknowledged as a terrible design pattern.
You don’t get to simultaneously claim that glibc has an optimal lock implementation and that high performance applications should implement their own. That implies the glibc implementation isn’t actually optimal because you’re paying a penalty for features that aren’t actually necessary.
Sounds like a problem with POSIX rather than glibc. Nothing about glibc has stopped Webkit from using their own locking implementation alongside, or it could be supplied as a library (libraries after all provide a vast array of Linux features which operate above or alongside libc). You could even work through the POSIX committee to get lighter weight locks approved and then have it added to glibc and then work to modify existing programs to use them. So I don't see how this relates to the supposed inefficiency of glibc's pthreads implementation.
Nothing is stopping glibc from shipping a non-POSIX extension to locks / doing that standardization. Like I said, I have more interesting things to spend my time on. In my eyes POSIX is an ossified standard and dead standard that hasn’t shipped anything interesting or consequential in a very long time.
Pthreads are fine as a default for most applications but then most applications shouldn’t be using threading primitives anyway and we should be using higher levels of abstraction that user better locking primitives under the hood.
(1) It only works on one architecture. Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.
(2) It's very complex and error-prone compared to using the pthread_* APIs (and they themselves are not exactly easy to use correctly).
This article is a good overview of why using the futex syscall directly is hard: https://lwn.net/Articles/823513/
(3) You're missing future bug fixes and optimizations.
(4) The kernel may move to a different API in future and glibc will handle that for you, but this code will still be using the old API.
The kernel has changed its threading APIs, the clone API, and how syscalls are made several times in the past, and you cannot predict how they will change in future.