Hacker News new | past | comments | ask | show | jobs | submit login
Implementing simple cooperative threads in C (brennan.io)
304 points by brenns10 on May 24, 2020 | hide | past | favorite | 86 comments



FWIW, the original implementation of Java (when it was known as Oak) used this trick to manage user level threading. There was also a some code to do the stack swap since you were in a virtual machine anyway, that stuff was all there to be adjusted.

Random meta note brenns10, the idiom for giving up the processor is 'yield' (rather than 'relinquish') it is common in cooperative systems.

That said, the code in Java that switched threads was called the 'trampoline' code since you bounced in, switched threads and bounced out.

And my last comment is that if you set up your scheduler in a way that it has a 'runnable' and 'sleeping' queue, you can do clever things like have a task sleep on a mutex or other conditional variable and wake up when that variable does what you're waiting for. It makes implementing network services that service multiple clients much cleaner.


> FWIW, the original implementation of Java (when it was known as Oak) used this trick to manage user level threading. There was also a some code to do the stack swap since you were in a virtual machine anyway, that stuff was all there to be adjusted.

I never knew Java did this. I figured most "interpreted" languages (via direct interpreters or bytecode) would be trivial to implement any kind of threads (even pre-emptive), because at any point the interpreter could decide to save the state of one language thread and switch to another, without bothering the interpreter's own stack.

> the idiom for giving up the processor is 'yield' (rather than 'relinquish') it is common in cooperative systems

Makes sense, I had been using relinquish for my other projects but I almost named it yield this time around.

> And my last comment is that if you set up your scheduler in a way that it has a 'runnable' and 'sleeping' queue, you can do clever things like have a task sleep on a mutex or other conditional variable and wake up when that variable does what you're waiting for. It makes implementing network services that service multiple clients much cleaner.

I'm really excited to tweak large parts of this, scheduler included! I'm really interested in allowing my scheduler to use `poll()` to determine readiness of blocked tasks, and yeah unblocking tasks for purposes like coroutines or mutexes or condition variables. This was the absolute minimum scheduler which would form a foundation to write baout, and I was really excited to write about the topic and share about it.


> I figured most "interpreted" languages would be trivial to implement any kind of threads, because at any point the interpreter could decide to save the state of one language thread and switch to another, without bothering the interpreter's own stack.

Some language interpreters keep the "virtual" stack and the C stack separate like this, including current versions of Lua. As you suggest, this allows Lua to support coroutines which each have their own virtual stack.

However other interpreters, such as CPython, keep both stacks synchronized. In this case, a call in the interpreted language corresponds to a call in C - recursively back into the main interpreter loop. This is why Python's generators do not have independent stacks.

The awfully-named "Stackless Python" [0] breaks the synchronization between stacks, to allow Python to support stackful coroutines.

At least, the above was my understanding until recently. The upcoming Lua 5.4 changes to using a recursive call [1], like CPython, but still supports stackful coroutines. I don't know how this is implemented.

[0] https://github.com/stackless-dev/stackless

[1] https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...


I was curious about these changes to Lua 5.4, so I had a look. When you create a Lua coroutine that allocates 200 Lua stack frames and then yields across them, the first longjmp crosses 200ish C stack frames. When you later resume, all those C stack frames are gone, but there wasn't any important state in them! Lua just calls luaV_execute in a loop[0]. (Trampolining?) So at that point, for that coroutine, the lower 200ish Lua stack frames do not have corresponding C stack frames, but any Lua subroutines you then call will have corresponding C stack frames.

[0] https://github.com/lua/lua/blob/9b7987a9d1471ba94764286b28e0...


I think he was referring to this https://en.wikipedia.org/wiki/Green_threads

BTW mid-grey text on yellow makes your great article difficult to read. I've had to open it in eww to read it. Worth it though, thanks!


Yep.. green threads were there, and not every platform had preemptive threading.

J2ME only had geen threads iirc.


Two observations about "unsafeness" of your threading system:

1. As a matter of paranoia I would likely write schedule() entirely in assembly -- it's a simple function and if you write it by hand you can be certain the compiler doesn't "cleverly" optimize something that will cause pain across the context switch.

2. Your 16K stacks are allocated straight out of the heap, which means stack under or overflows are going to stomp all over adjacent allocations (possibly other stacks), and be a pain to debug. You'll likely want to look at using mmap() and friends to allocate stacks with guard pages to catch this. You can also just range-check the SP when context switching (maybe as an optional debug build thing) which may catch things before a lot of damage is done.


It's a bit difficult to get guard pages working together with 2MB pages, though.

And in todays age of swap-less systems, you typically don't want to pollute the TLB with loads of 4k pages.


> I never knew Java did this. I figured most "interpreted" languages (via direct interpreters or bytecode) would be trivial to implement any kind of threads (even pre-emptive), because at any point the interpreter could decide to save the state of one language thread and switch to another, without bothering the interpreter's own stack.

I'm more or less sure, that original Oak/Java did cooperative threads only on the VM level by simply swaping the interpreter state without any C-level context-switching. But it was still cooperative and threads had to yield or block on something. Also the scheduling was simple round-robin with different static priorities and highest priority thread will always run unless it is blocked.

On the other hand in 90's many large portable software packages had their own userspace greenthread implementations with hand-written assembler context switches because many OSes did not have threads and these that had had wildly different and incompatible implementations. This was done by both Netscape/Mozilla (eg. NPR) and AFS (where IIRC the threading implementation is somewhat deeply integrated with used RPC mechanism), I would not be surprised if one could find remnants of something similar in LibreOffice codebase.


Is this why java.lang.Thread still has the yield() method?

All Java SE implementations I've ever seen use (preemptive) OS threads, so I assumed it's been this way from the beginning, and this yield() never made much sense to me.


It still makes sense if you implement spinlocks and other lock free datastructures yourself and want to give other threads a chance to run. But that’s more of an expert use-case - typical libraries and applications shouldn’t use it.


Giving a chance to threads/processes with the same priority to be scheduled without doing any I/O or syscall? Used this once.


IIRC Java didn't support native threads until 1997 or so.


> the idiom for giving up the processor is 'yield' (rather than 'relinquish') it is common in cooperative systems.

Don't tell the Python programmers...


the idiom for giving up the processor is 'yield'

Back in the old 8-bit days we called it “sleep” since you were telling the scheduler if you needed to come back ASAP (i.e. sleep(0)) or N number of microseconds in the future.


And now they've come full circle, offering coroutines in userland java code again.


I've done this before [1], many years ago. At the time, I had never heard of setjmp or longjmp. I understood task switching as 'push all registers onto the stack, change stacks, pop all registers off the stack', and so that's what I wrote for the task switching routines using gnu assembler. Making it pre-emptive was a matter of calling it periodically. I wasn't sure about timers either so I ended up hooking the timer interrupt handlers directly. You could use either irq0 timer or irq8 timer. I recall getting my dos boxes to 'multitask' inside windows by hooking irq8.

[1] - https://github.com/zenakuten/lwp20


Interesting to compare and contrast with Simon Tatham's method of implementing Co-Routines in C:

https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Submitted here: https://news.ycombinator.com/item?id=23293835

That's decades old, and actually used in PuTTY.


Note that SGT's coroutines are stackless, whereas this article's coroutines have their own stack.

It's also worth noting that you don't need asm to change the stack pointer, if you don't mind being imprecise about its exact value by the size of a function invocation frame or so: you can use alloca() or variable length arrays to do arithmetic on the stack pointer! https://fanf.livejournal.com/105413.html



That's really cool, and much lighter weight for the purpose of coroutines!


Still looking for something appropriately handle blocking and resuming without a full RTOS, but I do like both these options.



People just love re-inventing the wheel. I created my own implementation, back in the early 90s in Borland Turbo C and used it to implement a multi-terminal POS system for a busy bar. Did not know what the technique was called at the time though.

Then I joined a company in 96 and discovered that their software all ran on a custom in-house developed multitasking O/S (!). I immediately recognized the task switching mechanism and formally learned about coroutines.

Now here we are in 2020 and the youngins are still rediscovering the same thing.


And it will keep happening, because there are many more people starting programming each year than old people that can teach them.


And it's a Good Thing, since it fosters a inventive way of thinking around a problem rather than just reading/hearing that X exists, download that and be done. It's part of learning and honing a skill.

This doesn't mean that the implementation should then be used in practice, there are likely other solutions that are better suited, more mature, etc.


I agree!


The main problems I've seen with cooperative threading (aka fibers) are threefold:

1. You need to decide how much stack space to allocate to a fiber beforehand. If you go over this limit, you will segfault. 2. Thread local storage doesnt exist anymore since you're potentially swapping between different kernel threads (and different fibers are using the same thread and will share storage). 3. Certain types of locks can now fail because the lock isnt transferred to the new thread upon being woken up. Other locks run into rentrancy problems here.

Does anyone know of good solutions to these problems? Is there a library that takes all these things into account?


1. Easy on a 64-bit system. Just allocate dozens of megabytes of virtual address space with the memory being not readable nor writable. Mark a small proportion as readable and writable. In response to segfaults, grow the stack by marking even more memory as readable and writable. My toy implementation does this. It's very easy to do in about 20 lines of code.

2. You will need to reimplement thread local storage yourself. A production-quality implementation should handle this for you. Thread local storage just means having a special section of memory as a read-only template, and then duplicate them into fresh pages upon the creation of the fiber, mark them as writable, call constructors. And then remember to save the segment registers when switching. Tedious but doable.

3. You need your own locks. This is apparent even in high-level languages like Python. Notice how Python asyncio provides its own mutexes and semaphores? A production quality implementation will handle this for you, but you'll need to use them instead of the OS-provided ones.


I think there's a fourth problem that's even more dire: what do you do about blocking?

A blocking call stalls the whole system. It doesn't in a non-cooperative threading environment. Various languages have "solved" this by not allowing you to use blocking calls and sometimes add syntactic sugar like async/await, but this sucks. The point of having threads is to use them like threads. At least Go gets this right.


> Various languages have "solved" this by not allowing you to use blocking calls

That is the right solution. Do you seriously think in Go when you make an apparently blocking system call, an actual blocking system call is made? No, you are calling the runtime provided version that doesn't actually block. Of course in C/C++ that needs more discipline to do.


> Do you seriously think in Go when you make an apparently blocking system call, an actual blocking system call is made?

Who cares? From the perspective of the programmer, it's a blocking call.


Then how is a syntactic construct like "await blocking_call()" inferior? I fail to see your point. The await keyword even signals to you the fact that this call may be blocking the progress of the current fiber/coroutine.

And back to the original example, it's totally possible for a production implementation to handhold the user like this by, e.g. supplying a modified libc so that all the blocking system calls are linked against the runtime's implementation. That would basically achieve what you liked in Go.


> Then how is a syntactic construct like "await blocking_call()" inferior?

It's inferior because you have to actually specify it whenever you make such a call, rather than having the runtime or library be smart and figure it out.

> And back to the original example, it's totally possible for a production implementation to handhold the user like this by, e.g. supplying a modified libc so that all the blocking system calls are linked against the runtime's implementation.

I know. I didn't say it was intractable. Just like problems 1-3, they all have potential solutions -- they just involve hard work.


I see what you mean now. Omitting the keyword "await" is incredibly easy to do; simply do something different when the returned type is an awaitable. But the fact is that most languages haven't done this, and have you considered that perhaps language designers considered this issue and decided that explicit is better than implicit? And there are also a whole host of other issues too.


Regarding thread local storage - isn't that typically implemented using the segment registers (on x86 at least)? So you would need to vary those between fibers (instead of just between kernel threads) and your scheduler would have to explicitly save and restore them. I would guess that's not possible to do in portable C though.


(2) and (3) are sort of symptoms of trying to use the C runtime in what is definitely not a native C threading environment.

(1), Golang does something like this by probing on every function call and having movable stacks. Historically they used a linked list of stack pages, but gave that up in 1.4 because the performance was bad. The same does not work in C because C compilers do not expect pointers to the stack to become invalid while a stack frame is live.


Another interesting approach to lightweight cooperative threading is Protothreads [1][2]. Those are however stackless and make use of duff's device for the implementation; rather, than setjmp/longjmp.

[1] http://dunkels.com/adam/pt/ [2] https://en.wikipedia.org/wiki/Protothread


On GCC/Clang, pt uses goto and the “labels as values” GNU extension so it can avoid Duff’s device. This simplifies things, and allows yielding inside a switch statement.


A more common method of implementing this that avoids setjmp and longjmp is to code a little assembly to do the stack switch. This is typically faster and easier to port, see:

https://probablydance.com/2013/02/20/handmade-coroutines-for...


Faster I can believe, but how can custom asm be more portable than setjmp/longjmp which are standard since C89?

Although in this particular instance TFA also uses inline assembly to setup the stack, so it's moot.


I'm wrong, I seemed to remember many platforms not supporting them but a quick Google search reveals its just a foggy memory. I was probably thinking about ucontext actually.


Most if not all RTOS's swap threads by saving and restoring the stack and registers. Most of the ones I've seen you can turn off preemption and you then have cooperative threads.

It's much easier to deal with shared memory with cooperative threads at the expense of higher latency.


ucontext is cursed with the behavior of setting and restoring signal masks with each context switch, involving a kernel syscall. If there was a ucontext-lite standard API that did not mess with signal masks you could cheaply switch green threads in userspace.


I did something similar with I implemented coroutines for C (both 32 and 64 bit x86).

[1] https://github.com/spc476/C-Coroutines


I implemented something similar last year: https://github.com/kccqzy/green-threads/blob/master/green_th...

Actually implementing it was quite fun and I encourage every enterprising systems engineer to try. My implementation even supports growable stacks (up to a limit of course).


we used something similar when I was writing games in the early to mid 90's, but our yield function was simply called yield().

we were running on windows (3.11), macOS, and amiga (for tooling, the games didn't run on the amiga) with great success.

of course, we just used setjmp and longjmp, no assembly for portability (though there ended up being some assembly in one of the games, while macOS had good sound management, windows did not at the time and the audio mixing code ended up written in hand-rolled x86 assembly in order to perform well enough for the games to work on the hardware at the time).


DIV Games Studio, a language and IDE to develop DOS games on late 90's, used coroutines on a similar fashion. Even it had priority levels, signals and hierarchies of coroutines. On DIV parlance, this coroutines where called "process", and yield() was called "frame()". I managed to remake this from scratch, using D fibers, for a pet/toy game engine that I'm writing.


I've always really liked the idea of cooperative multitasking: fewer gotchas, possibly safer memory access, and easier to control where/when a thread switches. of course in those days we had to deal with timing issues, couldn't let a cpu hog block another cpu hog :)


Cool! I wrote one of these in the same timeframe to run coroutines on a dos extender in 32bit with flat addressing.


I recall being taught a (fairly rudimentary?) explanation of the memory model that the available space is used from one side as the stack, and from the other side as the heap. Reading the article, I realised that memory management in a concurrent context is way more complex than that, but you don't go into a lot of details here (fair enough, that's not the main focus).

So each task gets 16k of stack space, but what if it runs out of it? These are allocated from the "main" heap, so they won't exactly run into each other, but might trigger UB or crash the program. This is something that a real scheduler (like the one in the OS's kernel) would have to deal with anyway, so how does, say, Linux, solve this? Giving a lot of stack space to each running thread would lead to fragmented and underused memory, or is it not a concern at the stack spaces you'd normally deal with?


> Giving a lot of stack space to each running thread would lead to fragmented and underused memory, or is it not a concern at the stack spaces you'd normally deal with?

That's exactly what happens, and it's not a problem in practice -- each thread gets 8 megabytes of memory allocated. (Note, due to demand paging, physical usage grows in 4kb increments.)


This style of coop multi-tasking was in Crash Team Racing on PS1 for a while. Unfortunately on a machine with only 2meg of ram it turned out to take too much memory and be too fragile. The stacks had to be small, add one printf for debugging and you'd overflow the stack. Eventually it was torn out.


I'm surprised you didn't pile on by making a printf which switches to the main stack, does the actual printf, and switches back.


printf was just an obvious example. Any function had this problem. You run, things work, and then under some condition the function you called called something else and crash. Or you try to add some functionality to a function in the system and suddenly the tasks crash because your edit used more stack.

The point isn't to dis this method. It's just to say it was frustrating with tiny stacks.


I see, that does sound tough.


What was it replaced with?


The typical

    for g of gameobject
      g.update()


This is a really great explanation. I believe that libdill [1] [2] [3] works this way but this gave me a much more thorough understanding.

[1] http://libdill.org/ [2] Call to setjmp: https://github.com/sustrik/libdill/blob/de7a917bc39756f61237... [3] Call to longjmp: https://github.com/sustrik/libdill/blob/de7a917bc39756f61237...


Many C implementations allow this without any assembly: https://en.m.wikipedia.org/wiki/Setjmp.h#Cooperative_multita...

(2007)


An alternative to implementing your own fibers is to use user-mode scheduled threads. It has advantages such as these are real threads, so supporting things like thread-local storage is much easier, and is much more forgiving when the code uses blocking system calls. Unfortunately the only mainstream operating system that implements it is Windows: https://docs.microsoft.com/en-us/windows/win32/procthread/us...


Here's the best prior art I know of in this space: https://swtch.com/libtask/

I would not recommend taking this approach in C.


I wonder how hard it would be to try handle the segfault and allocate stack space on demand for these “threads”. Almost certainly stupid but a fun experiment to see if its possible...

EDIT: found a nice example of this here: https://rethinkdb.com/blog/handling-stack-overflow-on-custom...


The kernel does this for you, generally. You could use a segfault handler, stack probing, and a guard page to figure out when you're out of stack, but it's not clear how you grow the C stack if you're out of contiguous virtual address space.


This kind of context switching can be more easily achieved by using the old SysV APIs when working under *nix: http://man7.org/linux/man-pages/man3/makecontext.3.html


Before diving into something complex as e.g. FreeRTOS, can anyone recommend production ready code preferably in C that shows off preemptive task scheduling using an external timer including features like priorities and task delays?

The post is a good explanation regarding stack handling.


For a recently published comparison of various stack implementations, see here (pdf): https://kavon.farvard.in/papers/pldi20-stacks.pdf


The code and explanation of scheduler_relinquish seemed at odds to me.


I find the contrast between the text color and background color to be too low for comfortable reading.


Hey, I hear you and I don't want reading my site to be uncomfortable! My site uses the solarized light colorscheme [1], which I chose because I use the solarized schemes for my text editors. I think they are lower contrast to lend themselves to long editing sessions without being jarring on the eyes, but maybe that's not ideal for webpages?

In any case, if I added a small JS powered button to switch into a higher contrast style, would that make the site more usable to you?

[1]: https://ethanschoonover.com/solarized/


Hi Stephen, I appreciate your willingness to accept feedback. I'm flattered that you would add a button to make my reading experience better, but I think a website should be readable by default, so the button would just be a workaround from my point of view.

I would also like to respectfully challenge your assumption that lower contrast text is less jarring for the eyes for long reading sessions. If it were so, they'd be printing books in light gray color, wouldn't they?

Looking at the Solarized theme, my guess is it's the yellow-ish background that removes the strain on the eyes, as #fff is typically very bright. Since most people aren't focusing on the background when reading, it doesn't cause the eyes to focus more than usual. However, changing the color of the text is a different matter because that's what the eyes are focusing on, and making the text lighter causes it to be harder to read.

That's easy to verify: on the opacity scale of 0.0 to 1.0, where 1.0 is fully opaque, #000 (full black) at 0.1 opacity (almost transparent) is harder to focus on than 0.9 opacity (almost fully opaque). The only reason why some (most?) designers wouldn't be caught dead using #000 at 1.0 opacity for text, despite it being the most readable text color, is because very few things in the real world are at #000 or #fff, so the general advice is to use some shade that's close to it to make the webpage look more natural.

They're essentially trading off readability for aesthetics, and my original comment was that you traded off too much readability for aesthetics, in my opinion.

That said, if I were you, I wouldn't trust my opinion or anyone else's really. Seeing as you made it to HN, you could run an A/B test to see if serving a higher contrast page causes people to spend more time on that page on average, presumably because they're spending more time reading the content, instead of just skimming it like me. That way you'd be making informed changes based on data, not people's subjective experiences (including mine and yours).

Just for fun, I fiddled with the text color on your webpage and I'd say anything lighter than #555 for the body and #666 for the h2 seems too washed out (with the same background color).

Finally, I do want to note that the content is excellent. Happy Memorial Day!


Totally agree, this makes for horrible reading - why would you think light grey on white was a good idea?!


This looks like more work and more opportunities for bugs than protothreads.


I highly recommend thread analyzer tool https://fastthread.io/


Yes, this has been done ad nauseam. Do you know why no one uses it? Reasoning about cooperative threads in C is impossible. Reasoning about async C code is impossible.


> Yes, this has been done ad nauseam. Do you know why no one uses it?

Yes.

> Reasoning about cooperative threads in C is impossible.

Well, sorta. The problem with these approaches is mostly that C runtimes expect you to use the native operating system thread construct, not your own handrolled thing. If you were writing a libc for your own operating system you could use something vaguely like these libraries.

Another obvious problem is the lack of utilization of multiple CPU cores. And once you try and mix the two (M:N threading), the system becomes difficult to understand (and often has worse performance than just 1:1 threading).

I wouldn't say it's because reasoning about them is impossible, but that in practice, they are extremely difficult to use.

> Reasoning about async C code is impossible.

No, that part isn't true.


> It’s like a goto statement, but it can even be used to jump outside of a function. It’s also a lot more difficult to read than a goto, since it looks like a regular function call.

Ah! Blood pressure rising!

> Like with goto, the common advice is to avoid

Whew. And then goes on to make a really good case for complicated task perform with simple error handling.

It’s a good application for not exactly general use cooperative routines but a specific use case of performing multiple procedural steps.

I think there it falls short and where an RTOS/OS will be more useful is when you have to wait on one of those steps.

I think that’s the harder example with the example of an HTTP request is waiting for the response and keeping that context ready and blocking in a separate thread. That ends up being easy to read and maintainable. But I suppose there is no reason you can’t do both.


Does your blood pressure also rise from use of exceptions?

Because the pattern that emerges with setjmp/longjmp is pretty similar, and that's the most common use: to simulate what other languages offer with exceptions. The most prominent libraries I can think of that force you to do this are image related, libpng and libjpeg both use longjmp to handle errors [though IIRC it's optional in the latter].


Are there exceptions in C?

If I’m interviewing someone for a programming job and I see goto in there C code... they better have an amazing reason or they won’t be getting the job. Harsh, but it’s reality of how few people are suited for embedded programming.


Standard C has no exceptions.

Goto is the typical way to handle errors without creating a hell mess of "hm, gotta close this file, free that buffer, blah blah" at every step where you might want to bail early. There are a few alternatives but none of them are better than goto and a few can be said to be worse.


I spend a lot of time dicking with embedded C code. I think using goto is reasonable some cases. The classic case is to short circuit a block of code when there is an error of some sort.

   function(){
   
   // early returns here.

   // stuff
   // more stuff

   if(oops_err){
     goto some_error;
   }

   // yet more stuff

   // normal exit
   return 0;
   
   // bad exit
   some_error:

   // clean up

   return oops_err;
   }
The above is fine, you avoid a convoluted execution path. Or cleanup and a return in the middle of a function. Looking at the top of the function you see early returns. Looking at the end you see the normal and error exits.


This is specifically allowed in the Linux kernel coding style guidelines as well FWIW:

https://www.kernel.org/doc/html/v4.10/process/coding-style.h...


Not only is it fine, I think it's easier to reason about the control flow with this code than with languages that support exceptions like C++.


While not being a prescribed way of doing things (deservedly) goto as well as some other "obsolete" tricks sometimes can actually make sense in low level C code especially on micro-controllers.

Denying potential job to a person on a basis of generic "goto is anathema" mantra is not a good idea until one looks at the code and understands what the "offensive" part is actually doing instead of using some code analyzer/text search blindly.


Not sure what presence or absence of the goto has to do with embedded programming.


Using goto is standard for handling errors in C. It's used pervasively in the Linux kernel.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: