Basics of Futexes

bonzini · on July 13, 2018

An important part of futexes, which the article mentions only in passing, is that the kernel will also check a condition atomically with the wait operation; if the condition isn't satisfied the wait will not be performed. This avoids races between a thread that unlocks the mutex right after another thread has invoked the system call.

Effectively, inside the kernel the algorithm is

    lock wait queue
    if (*var == val)
       add myself to wait queue
       unlock wait queue
       wait until queue wakes me up
    else
       unlock wait queue

eliben · on July 13, 2018

Thanks for mentioning this, it's indeed one of the interesting subtleties of futex().

In my defense, I explicitly say in the post that it's not a replacement for the man page and suggest the readers at least go through the descriptions of FUTEX_WAIT and FUTEX_WAKE. The former explains this behavior in the very first sentence.

Moreover, the post mentions this behavior twice:

1. When explaining spurious wakeups in the wait_on_futex_value

  [...] FUTEX_WAIT will immediately return if the
  value at the futex address is not equal to val. In our case this can
  happen if the child issued a wait before the parent wrote 0xA, for
  example.

2. In a comment inside Mutex::lock()

   // We're here when either:
   // (a) the mutex was in fact unlocked (by an intervening thread).

But your point on avoiding an insidious deadlock is spot on.

bonzini · on July 13, 2018

Indeed it is mentioned. I wrote the comment because repeating the check in the kernel on a user provided value is what distinguishes futexes from a condvar. Of course muted+condvar+int is not a replacement if you're implementing pthreads, but it's also much less efficient than futexes if you're implementing a synchronization primitive not provided by pthreads.

(The most common use of futexes that I have seen outside glibc is to implement binary semaphores. Binary semaphores can be a lighter-weight alternative to condition variables when only a single thread can ever wait on an event).

cozzyd · on July 14, 2018

In my experience futexes are primarily used to confound valgrind. Damn you libgomp (yeah I know you can recompile gcc to use pthreads in the openmp implementation).

Animats · on July 14, 2018

Here's the original implementation of that idea, from the early 1970s.[1] UNIVAC EXEC 8 had this. See section 4.3.4, page 148, "Test and Set Queuing".

This OS had threads, decades before UNIX got them. They were called "activities". These were also multiprocessor systems. So it had to have user code synchronization. The UNIVAC mainframes had a "test and set" instruction. This was an atomic operation which set a flag. If the flag was already set, an interrupt occurred, so the OS could reschedule.

This could be used in a spin mode, where hitting a locked lock just rescheduled the activity for later. Or the user could give their threads IDs and manage them with system calls, like UNIX pthreads. The third option, test and set queuing, was added later. There was a list of activities in user space associated with each lock, and when a activity was blocked, it went on the user space queue for that lock. The "clear test and set" operation was an assembly language macro, which took the first waiting activity off the list and told the OS to unblock it.

The normal case of taking the lock without delay did not involve the OS at all. Even pthreads works that way in some implementations, as it should.

[1] http://bitsavers.trailing-edge.com/pdf/univac/1100/exec/UP-4...

Iwan-Zotow · on July 13, 2018

It might be worth mentioning that MS Windows has very similar (though somewhat distinctive) function WaitOnAddress https://docs.microsoft.com/en-us/windows/desktop/api/synchap... , introduced in Server 2012 and Windows 8. See for details https://blogs.msdn.microsoft.com/oldnewthing/20170601-00/?p=...

ww520 · on July 13, 2018

One thing important to remember about futex is that it is not a lock, despite its name similar to mutex. Futex is a conditional sleep mechanism. The kernel wakes up the caller when the futex flag's state changes. When awaken and returns from the futux() call, you don't get the lock. It just means the state of the lock has changed and it's time to check and attempt to acquire the lock again. Actually multiple threads sleeping on the futex flag can be awaken at the same time (depending on the FUTEX_WAKE command's parameter). All of them need to do compare-and-set to acquire the lock.

bogomipz · on July 13, 2018

>"When awaken and returns from the futux() call, you don't get the lock.

Can you elaborate on this? That sentence isn't reading well for me.

>"Actually multiple threads sleeping on the futex flag can be awaken at the same time (depending on the FUTEX_WAKE command's parameter). All of them need to do compare-and-set to acquire the lock."

So whatever thread kernel schedules first will get the lock? I.e there is no fairness or weighting? Does this potentially cause a "thundering heard" when they're awakened at the same time?

ww520 · on July 13, 2018

Essentially to implement a lock in the user mode, you do:

    while !compare_and_swap(&flag,0,1) // check if flag is free (0), lock it (1)
        futex_wait(&flag,0)            // sleep until flag changes

When futex_wait() returns, it just means the flag has been set back to 0 and the kernel wakes me up so I can check on it again. Another thread can come in and grab the lock in the meantime. That's why I need to loop back to check again. The atomic CAS operation is the one doing the lock.

In the old days, it's just a busy loop running the CAS check continually which is expensive.

    while !compare_and_swap(&flag, 0, 1)
        // busy loop running

With futex(), the lock-acquiring threads can go to sleep and waking up with the help of the kernel.

Yes, multiple threads can be wakened up, the nr_wake parameter in futex_wake [1] dictates how many threads to wake up. Yes, you can get a "thundering herd" problem. For fairness, see priority usage in my other comment https://news.ycombinator.com/item?id=17526849

[1] https://elixir.bootlin.com/linux/v4.18-rc4/source/kernel/fut...

bogomipz · on July 13, 2018

Thanks for clarification and the examples. This makes sense. Cheers.

noahdesu · on July 13, 2018

Nice! I have fond memories of studying the mechanics of futexes when PREEMPT_RT was under heavy development.

It's been a long time since I thought about the kernel futex implementation, but IIRC, one of the most interesting aspects was that the kernel would actually navigate a linked list that existed entirely in userspace. I believe this was called, or related to, robust futexes [0].

[0]: https://www.kernel.org/doc/Documentation/robust-futexes.txt

nkurz · on July 13, 2018

Does anyone know how "fair" the current Linux implementation of futexes is? I experimented with them soon after they were introduced, and at that time there was a great deal of favoritism: under heavy contention a thread/core that had just given up a lock was most likely to reclaim it, and other threads/cores would be starved. This isn't always a bad thing (and can sometimes even help performance) but didn't work well for my application.

ww520 · on July 13, 2018

Most of the check to acquire the lock happens in user mode. It would favors the thread in RUNNING state that still has quantum left. For example, if a thread is running, has the lock, does its work, releases the lock (which wakes up other sleeping threads but they are in RUNNABLE state not RUNNING yet), and the running thread immediately loops back to acquire the lock again, it would succeed because the lock is free. When that running thread's quantum runs out and put to sleep (still holding the lock), the other awaken threads runs but the lock is not free so they promptly go back to sleep to wait for the lock. The old running thread wakes up and continues the cycle.

If you want fairer lock acquisition, you need to use locking priority. Just lower the running thread's priority when it finishes its work. The other sleeping threads with higher priority will have a chance to get the lock.

raattgift · on July 14, 2018

> Just lower the running thread's priority when it finishes its work. The other sleeping threads with higher priority will have a chance to get the lock.

What happens if immediately after the running thread lowers its priority priority it is descheduled? Is it still holding the lock at that point? Could it remain descheduled at low priority for a relatively long time while the other threads wake up and fail to acquire the lock (and also while any other unrelated threads go about their unlowered-priority business)?

Why wouldn't the workless thread just give up the lock (and possibly self-deschedule) when it has finished its work? If it is then assigned more work it would have to be scheduled and when running again contend for the lock, which is presumably what one wants, right?

ww520 · on July 14, 2018

You would release the lock first before lowering the priority.

As a separate issue, the priority inversion problem is addressed with the priority inheritance technique, which futex supports directly. Basically the priority of the thread holding the lock is boosted to the highest priority of all the threads sleeping at the lock, making sure it can run to completion. When the lock is released, the thread's priority is lowered back to its original level. All the xx_pi() APIs are the priority inheritance versions.

There are a number of ways to ensure fairness for lock acquisition; however, performance often suffers, so people just live with the problem and design the workload around it.

bogomipz · on July 13, 2018

At the end of the piece the author states:

>"The Go runtime does not use libc, in most cases ..."

Can someone say in which cases does the Go runtime elect to use libc?

ecnahc515 · on July 13, 2018

If and when/depends on some env vars and also compile time options. Basically it mostly uses libc for some network operations like hostname resolution, and a few other things like user/group resolution. However it defaults to the Pure go implementations by default for most things if possible.