Hacker News new | past | comments | ask | show | jobs | submit login
Practical Libc-free threading on Linux (nullprogram.com)
125 points by ingve on March 23, 2023 | hide | past | favorite | 51 comments



Several reasons why this is a terrible idea, aside from learning how things work:

(1) It only works on one architecture. Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.

(2) It's very complex and error-prone compared to using the pthread_* APIs (and they themselves are not exactly easy to use correctly).

This article is a good overview of why using the futex syscall directly is hard: https://lwn.net/Articles/823513/

(3) You're missing future bug fixes and optimizations.

(4) The kernel may move to a different API in future and glibc will handle that for you, but this code will still be using the old API.

The kernel has changed its threading APIs, the clone API, and how syscalls are made several times in the past, and you cannot predict how they will change in future.


The kernel has changed its threading APIs, the clone API, and how syscalls are made several times in the past, and you cannot predict how they will change in future.

I thought "don't break userspace" was one of the key principles of Linux design?


The old APIs continue to be supported, but they may be slower or have other limitations. eg You can continue to make syscalls through int $0x80 if you want, even on the latest x86, but performance will suck. You can keep using clone instead of clone3, but you won't get to use all the newer features.


So your code will keep on working as good as it ever did. I fail to see the problem with that.


Nothing guarantees that. clone() may still be provided by the kernel but nothing guarantees it will still be fast and not just a compatibility retrofit. Particularly since most users will move on automatically through libc.


> (1) It only works on one architecture. Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.

A portable solution (on any POSIX system, actually) for the stack switching would be to jump to the new stack using sigaltstack and initialize a sigjmp_t context (i.e. setjmp) from the signal handler. After calling clone you conditionally invoke longjmp based on the return value. This works because setjmp and longjmp are compiler builtins; but if not, use __builtin_setjmp, __builtin_longjmp. (You can pass the sigjmp_t context using the same trick mentioned in the writeup. And of course this also assumes the compiler won't emit stack-dependent code between the syscall and longjmp.)

You still have the problem of implementing the syscalls, though. But maybe if we scrounge around we can find solutions. On some architectures the vdso Linux maps into the process address space has a syscall implementation. And maybe there are similar compiler builtins for other architectures, or macros in the Linux header files with the necessary inline assembly.

As a learning exercise this type of stuff is fun ;)


setjmp and longjmp are not normally compiler builtins, just some library functions. They are quite different also from the __builtins that you mention, sometimes in quite surprising ways. Actually compilers are very buggy when you use these functions, even if you do follow their quite strict rules for what is allowed. Calling longjmp in a signal handler is technically undefined behavior in posix and trigger issues on some platforms that I have had the displeasure to run into. Recently various hardening options in the definitions of setjmp now make it detect these sorts of attempts and abort the program. So I wouldn’t call all of that very portable, but I agree it is fun!


Yup. Additionally, if you need portability (i.e. don't want to depend on the system libc) you may use Gnulib [0].

[0]: https://www.gnu.org/software/gnulib/manual/html_node/Multith...


The recently published book Rust Atomics and Locks by Mara Bos has an interesting section on why Rust uses the futex syscall directly.

> ... However, there are a few issues with that, as this pthread type was designed for C, not for Rust.

> In Rust, we move objects around all the time. ... The pthread types we discussed do not guarantee they are movable, which becomes quite a problem in Rust. Even a simple idiomatic Mutex::new() function is a problem: it would return a mutex object, which would move it into a new place in memory.

> A solution to this problem is to wrap the mutex in a Box. By putting the pthread mutex in its own allocation, it stays in the same location in memory, even if its owner is moved around. This is how std::sync::Mutex was implemented on all Unix platforms before Rust 1.62.

> The downside of this approach is the overhead: every mutex now gets its own allocation, adding significant overhead to creating, destroying, and using the mutex. Another downside is that it prevents the new function from being const, which gets in the way of having a static mutex. Even if pthread_mutex_t was movable, a const fn new could only initialize it with default settings, which results in undefined behavior when locking recursively. There is no way to design a safe interface that prevents locking recursively, so this means we’d need to make the lock function unsafe to make the user promise they won’t do that.

> A problem that remains with our Box approach occurs when dropping a locked mutex. ... pthread specifies that calling pthread_mutex_destroy() on a locked mutex is not guaranteed to work and might result in undefined behavior. One work-around is to first attempt to lock (and unlock) the pthread mutex when dropping our Mutex, and panic (or leak the Box) when it is already locked, but that adds even more overhead. These issues don’t just apply to pthread_mutex_t, but to the other types we discussed as well. Overall, the design of the pthread synchronization primitives is fine for C, but just not a great fit for Rust.


To make it explicit, #4 is basically making the case for why the Linux setting the syscall interface as the stable API boundary is a poor approach.


What alternative is there? Linux is a more minimalist OS than old-school UNIX, and libc as a userspace construct allows more evolution in that space (see glibc vs musl vs uclibc).

I could see an argument for small shims injected by a vDSO (AIUI this is what Fuchsia does), but clearly the 80s-style "libc is the OS boundary" is obsolete.


A user-mode library bundled with the kernel (like libc, but doesn't have to be libc) is the alternative to the syscall boundary. You maintain the library as the stable interface for the user, and let the syscalls change as you please.


Yes, I understand the mechanism, but why would you want to structure your OS like that?

You either have to maintain a stable syscall ABI, or a stable userspace function ABI, and it's not like one is materially more difficult than the other. If anything, the Linux approach has proven to be superior -- glibc has had multiple breaking changes over the years and to this day users will resort to Docker to run binaries that depend on "old" glibc, but programs written against the Linux syscall ABI twenty years ago will run fine to this day.

Even if you decide to have a userspace trampoline/shim, it doesn't make sense for that to be libc. The C standard library is huge and requiring userspace programs to link it regardless of implementation language does nothing except add forty years of bad ideas into your process's address space.


Multiple organizations ship Linux kernels which do not support the original x86-64 system call ABI at all. Old applications only keep running because these applications are dynamically linked against glibc, and the system glibc version is new enough to use the new system call ABI. Some Linux subsystems are curiously exempt from the ABI stability, or different ABIs can be provided through compile-time or system configuration. Either way, as an application developer, you cannot be sure which ABI will be available.

On the glibc side, it's challenging for us upstream developers because a lot of people assume that we do not aim to provide backwards-compatibility, so they do not bother reporting compatibility issues. I suspect this perception is there because one of the backwards compatibility mechanisms we use prevents running binaries built on newer systems on older systems. (New programs use a new symbol with a backwards-incompatible change, old programs get the old implementation.) But that is not actually about backwards compatibility, it's requesting forward compatibility.

The goal is to require recompilation only in limited scenarios: static libraries and object files not yet fully linked, deliberate dependencies on undocumented internals (e.g., internal struct offsets), and dependency on behavior that is not standards-conforming (e.g., non-sticky EOF on stdio streams). And in the latter case, it's often possible to add a kludge to maintain backwards compatibility.


If anything, the Linux approach has proven to be superior -- glibc has had multiple breaking changes over the years and to this day users will resort to Docker to run binaries that depend on "old" glibc, but programs written against the Linux syscall ABI twenty years ago will run fine to this day.

Windows takes the opposite approach (except that the libraries are gdi32/kernel32/ntdll/user32/etc.), and it's definitely an example of backwards-compatibility done well.


I think the case of Windows is slightly different. Its architecture has the same backwards compatibility challenges of GNU libc, but where GNU took a user-hostile approach ("just recompile"), Microsoft invested thousands of person-years to mitigate and resolve issues as they were discovered.

Most folks reading this are likely familiar with Raymond Chen's blog posts about compatibility shims for older binaries, or stories like the Windows 95 work around for memory-management bugs in SimCity. That's the kind of thing that happens when you design an OS ABI with problems, but are committed to bearing the cost of those problems yourself rather than passing them along to users.

I'm sure if you asked today's Windows team if they'd have a different approach, they'd rattle off a half-dozen great ideas for backwards-compatible syscall ABIs. There's been lots of research in that area since the foundations of our current OS ecosystem were laid down 30 years ago.


What approach would be better?


I think competing products define their interface in terms of calling a provided C library, and leave the entirely of the system call mechanics as an implementation detail.


For context, this is how it worked in old-school UNIX. If you look at (for example) SunOS, the kernel ABI is undocumented and everything in userspace depends on libc. The OSes and userspace languages of the day were deeply coupled -- C was the language of UNIX.

You can still see the remnants of this design in macOS, except instead of libc it's libSystem. Apple has decoupled their OS from C (in favor of Objective C, then Swift) but the idea of a userspace entry point to system functionality survived.

Linux is different from traditional UNIX because it's a kernel-only OS. There's no "Linux libc", and third parties have written multiple implementations of the C standard library that run on Linux. The only way this structure can work is for the kernel itself to provide stable ABI, and the standard approach at the time (mid 90s) was numeric syscall codes with positional parameters.


Yeah, and the particular old-school UNIX I had in mind was Win32. :)


OpenBSD does (did?) require all syscalls to originate from libc as well.


> Even this architecture may change optimal alignment requirements in future, which glibc would transparently handle for you.

Funny. “Optimality” is not a criticism I’ve ever heard levied often about glibc.


If you think glibc pthreads are not optimal then you could supply a patch and transparently fix literally thousands of applications and tens of millions of users. Otherwise I trust the glibc developers I work with who are extremely smart and focused on performance.


I highly recommend you take a look at https://webkit.org/blog/6161/locking-in-webkit/ then. It’s been known for a long time that pthread mutexes are very expensive and terrible. Unfortunately you can’t fix it AND maintain back compat (and you might lose technical compliance with POSIX). But that doesn’t mean that there aren’t drastically better locking primitives and it’s the epitomy of hubirus to think glibc is a great codebase. It’s just widely used and most people really don’t want to dive into writing their own libc. Literally every high performance application I’ve seen uses their own locking primitives, DNS alternatives etc. The only exception is when the locking isn’t a critical bottleneck.

Examples of the same concept in different contexts: WTF lock Folly locks ABSEIL locks Rust mutexes

I have more interesting projects to work on than glibc maintenance.


The reason WebKit rolls their own locks is because they have a specific needs that they can optimize around, which they do quite well. But a two-bit lock isn’t actually capable of supporting recursive locking, or fairness, or all the other things that people expect from a pthread mutex. You could say that most people don’t need those things and could get away with a more optimized lock, and you’d be right, but glibc supports the general case and does a fairly good job, so it’s a reasonable choice except for high-performance applications which end up rolling their own.


You don't need a larger mutex for fairness. For recursive locking, maybe. But most mutexes aren't recursive.

That's one of the big problems with pthread mutexes: they try to pack too much functionality into a single type. A pthread mutex can be recursive, so every mutex needs space to store a recursion count, even though most mutexes aren't recursive. A pthread mutex can have an arbitrary priority ceiling (pthread_mutexattr_setprioceiling), so every mutex needs space to store the priority ceiling, even though most mutexes don't set that attribute. A pthread mutex can be "robust" (pthread_mutexattr_setrobust), which at least under glibc means that every mutex needs previous/next pointers in it so it can be stored in a per-thread linked list, even though most mutexes are not robust. That's an excessive level of generality, when different kinds of mutexes could have just been different types!

The other big problem with pthread mutexes is just that they are old. I could be wrong, but my impression is that much of the interest in smaller mutexes has come about relatively recently, at least compared to the age of these APIs. It might be possible to shrink pthread_mutex_t to some extent despite the aforementioned issues, but it's impossible to do so on any existing operating system (at least on existing architectures) because changing the size of pthread_mutex_t would break ABI.


Another point about glibc implementation of pthread_mutex_t (and IIRC even the glibc internal lll_t that implements simple mutex as few lines of assembly involving futex(2)) is that on 32/64b platforms the layout of the structure is compatible between 32b and 64b ABIs. The reason for that is that pthread_mutex can have pshared attribute and also it is used in implementation of higher-level posix-IPC primitives (sane implementation of all of which involves just placing the particular synchronization struct into shared memory and not somehow translating that into SysV-IPC).


You'd need extra space to store the fact that the mutex is fair rather than unfair. (I agree with you that the API tries to cram too much into one interface, but that's really on POSIX rather than glibc.)


Look at the benchmarks. The fairness locks perform worse than if you didn’t have any fairness because you’re wasting a huge amount of CPU cycles trying to create that fairness.

Recursive locks are generally acknowledged as a terrible design pattern.

You don’t get to simultaneously claim that glibc has an optimal lock implementation and that high performance applications should implement their own. That implies the glibc implementation isn’t actually optimal because you’re paying a penalty for features that aren’t actually necessary.


Sounds like a problem with POSIX rather than glibc. Nothing about glibc has stopped Webkit from using their own locking implementation alongside, or it could be supplied as a library (libraries after all provide a vast array of Linux features which operate above or alongside libc). You could even work through the POSIX committee to get lighter weight locks approved and then have it added to glibc and then work to modify existing programs to use them. So I don't see how this relates to the supposed inefficiency of glibc's pthreads implementation.


Nothing is stopping glibc from shipping a non-POSIX extension to locks / doing that standardization. Like I said, I have more interesting things to spend my time on. In my eyes POSIX is an ossified standard and dead standard that hasn’t shipped anything interesting or consequential in a very long time.

Pthreads are fine as a default for most applications but then most applications shouldn’t be using threading primitives anyway and we should be using higher levels of abstraction that user better locking primitives under the hood.


If this weren't horribly mislabeled, it'd be a great behind-the-scenes article.

> Practical Libc-free threading on Linux

> This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls.

This is neither practical nor simple nor robust.

It is a good write-up on what happens behind the scenes when you set up threads in your language of choice with your library of choice. Great if you have that understanding.

Bring this into any kind of production code base (that isn't itself a threading library) and it's grounds for dismissal, because it makes very evident a severe lack of insight and judgement on what tools to use for a given problem at hand.


If you're writing applications exclusively for Linux, what's the practical difference between using libc and system calls?

Linux system calls are one of the most stable interfaces out there, maybe even more so than glibc. "WE DO NOT BREAK USERSPACE" and all that jazz.


Syscall numbers are different on a per-architecture basis and some architectures allow using the calling conventions of multiple architectures - something that a library will presumably abstract away.


syscall numbers may be different, but the names are not different.

BTW, this is a pet peeve of mine. Having different syscall (and errno) numbers on different architectures is an historical mistake from Linux's first non-x86 port, the DEC alpha. My understanding is that Linus used DEC UNIX to bootstrap Linux, and needed to make Linux compatible with DEC's syscalls initially. Rather than make a new ABI for DEC UNIX binaries in advance, or clean up later, everything was left in place leading to a mess of 19 errno.h files in the kernel today.

For comparison, FreeBSD has 5. 4 of which are related to linux compat..


> leading to a mess of 19 errno.h files in the kernel today

having "dirty" and "band aid" staff could be the price Linux is paying to move faster, dominate server market and extend beyond the server market. One can even say UNIX is extincted, even Microsoft added WSL and WSA, where L is for Linux, not UNIX/POSIX and A is for Android, which is sort of Linux too.

> For comparison, FreeBSD has 5. 4 of which are related to linux compat..

This sounds to me as FreeBSD is more "lean" or "clean" and even may be less "agile" and this aligns well with what I've read and heard from others on internet and in person. I don't have my own opinion here and cannot compare.

What I can compare though, is FreeBSD implementing Linux compat/Linuxulator and not vise versa ( and mentioning WSL above ), giving me impression of Linux/Linus' tactic (even strategy may be?) worked better than others.

Would be interesting to hear your opinion.


Custom threads does not mix well with libc. Unless you can avoid the whole libc ecosystem, this won't end well


It is hard to interface with other libraries if you avoid libc, unless you reimplement the whole libc ABI (thread locals for example).

Still, this stuff is very cool and I would love to have time to play with it.


Interfacing with other libraries is fine as long as those other libraries also avoid libc... though of course then you get the whole problem of everyone reinventing the various wheels libc offers. Hypothetically that would then lead to various libraries implementing all the basics springing up from the primordial soup of code and the community eventually deciding on some libraries that are "good enough" which become the de facto new standard.


Here is with support for aarch64 and more: https://github.com/kromych/tls-curious/blob/master/lib.c


But using non-standard C case-ranges: https://stackoverflow.com/questions/7043788/are-triple-dots-...

    case '0' ... '9':


Standards are much like religions imo: they promise to give the eternal life but the details are tbd. I chose to be practical.


Cool, that was simpler to understand than I expected. Still unsure of the utility aside from understanding how things work.


The manic abuse by the glibc devs of the gnu symbol/module versioning makes linux ABI stability extremely attractive (actually even ISO work has a smell of planned obsolescence nowdays).

This is a very good idea... since I decided to do exactly that for many of my personal projects and I started years ago.

That said, "clone" syscall is the easy part. For newcomer, a gentle introduction and explanation of the volontary cpu yielding "futex" synchronization primitive is kind of much more important, or we'll end up with spinlocks everywhere. Often, you need 2 "futexes" to build higher level synchronization primitives. One of the good things here is educational: it shows how much expensive a cpu yielding synchronization primitive is since usually it is "hidden" by the POSIX API.


Explain how to do locks without using the lock instruction!


futex will do lock instructions for you in kernel space. It's non optimal to just jump to kernel space every time, but it should still at least be correct.


A system call has even more overhead than the lock prefix!


True, but arguably this article constrains itself to the kernel-user boundary, where atomic instructions could be considered out of scope considering how the Linux syscall ABI is constructed.



does xchg count


That's an implicit lock.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: