However, I'm not sure of the benefit of this approach, compared to
writing a few lines of stack switching assembly code per architecture.
Especially, how portable are the implementations of longjmp and setjmp
across the multitude of C libraries and operating systems?
But more importantly, it doesn't seem to solve the two main issues
of coroutines libraries written in C that were already solved in Go:
- How does the stack can grow or shrink? (specifying the stack size
in the worse case scenario doesn't scale well)
- How can a coroutine be moved to another operating system thread
and back? (useful when blocking or when parallelism is needed)
This is really dang cool. I started a similar project[0] but have tabled it for now, since I'm not sold on the usefulness of making C easier to use this way.
The way I see it, there are two ways to use a library like this that are worth considering: adapting existing projects, and starting new ones. In my opinion, the gains to be had from rearchitecting an existing project to use such a library are minimal, and for the vast majority new projects, C is not the best choice these days. If you want the performance of C with the programming style of Go, you can probably get away with just using Go!
That said, it's nice to see a polished, working implementation of this idea. Definitely something worth being aware of.
and it is one of few developed-in-house things that became the productivity booster in my toolbox.
The interesting thing about it is that thanks to the expressive power of Tcl I was able to port not only the semantics but also almost completely mimic Go syntax.
> You can do concurrency on a single thread. This is the message I find it hard to get across when explaining CSP concurrency libraries.
Heck, there's an entire language built around single-threaded concurrency! Compiles to C, coincidentally - it targets embedded platforms. It's not quite CSP style though, it takes after Esterel and uses synchronous reactive concurrency:
However, as soon as you have multiple Céu programs communicating (for example, multiple Arduino's) the program-to-program communication is fairly CSP-like.
There's more to report than simply an awesome library. Two days ago, someone allegedly unfamiliar to the project (Paul Banks) appeared from the ether, applied a few "simple" patches, and improved the performance of libmill 7 fold (for OSX at least). That just doesn't happen, ever, in open source.
The speedup was caused by replacing setjmp by _setjmp though.
I know this may sound mysterious, but for BSD/OSX... "yes". It is "well known" (if you're the sort of person who writes coroutine runtimes [1]) that _setjmp is far faster than setjmp. It's morally equivalent to 'sigsetjmp', but more portable, if less standard.
> If I understand correctly, this is a framework for non-preemptive cooperative multitasking whereby provided API functions serve as yield points. Pretty much what Windows API was before NT/2000. Not exactly a co-routine library, at least not in a conventional definition of co-routines.
> Yes, that's exactly what it is. Yes, that's exactly what it is. But it also matches wikipedia's difinition of coroutine: "Coroutines are computer program components that generalize subroutines for nonpreemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations."
All coroutine libraries are non-preemptive and cooperative. E.g. in Haskell, threads will switch only on GC allocations. In Go, they only switch on I/O (and maybe function returns). In theory, the compiler could automatically add `yield` statements into non-allocating, non-I/O loops. But nevertheless, all user-space threading is cooperative.
> But nevertheless, all user-space threading is cooperative.
Not necessarily. You can do user-space pre-emptive threading on any system that support pre-emptive signal handlers and that lets you mess around with the stack, particularly if you can generate signals from timers and/or IO operations (both of which you can with POSIX timers)
I disagree. If it's the kernel who's interupting threads, it's not user-space threading. The scheduler just decides which thread gets run next, but the kernel decides when it gets run.
Technically it is not the kernel, it is the hardware clock.
At some point any user space scheduler will switch to the kernel in an event demultiplexing syscall and potentially block there if no user space thread is runnable. By your definition any threading system which doesn't handle all IO purely in user-space can't claim to be user-space threads.
But that is when the userspace must (and wants) to communicate with the kernel. Ultimately, that's the programmer's choice. You could easily write a program in Haskell or Go with two lightweight threads that would keep sending ping/pong messages to each other, and all the waiting (on locks, channels, etc.) and switching would be implemented in userspace.
I don't know enough about userspace/kernelspace communication, maybe I/O (refilling buffers or calling select) and timers are much more lightweight than full thread context switches, but I'm guessing it's still much slower than pure userspace scheduling.
How is a clock signal different? It is just another external event userspace wants to be notified about.
Note that on Unix you can use signals to get IO readiness notification and in fact for sometime on Linux (before epoll) realtime signals were the preferred method to do get notification over a large amount of fds.
Anyway, as a general guideline, user/kernel transition is cheaper than a (kernel) thread/thread transition which is in turn cheaper than a process/process transition.
Someone needs to make a tiny C lib that contains just the stack switching stuff (basically whatever is used as a substitute for ucontext), that the various coroutine libs can share. I like all these coroutine libs (libtask, lthread, libmill, etc), but it makes me uneasy to have non-portable assembly code lying about that might go unmaintained.
And if you don't like any of those, my next choice would be to use the implementation from Luajit 1.x and 2.x, called Coco but not packaged into a separate library:
Ah, I forgot libmill uses setjmp/longjmp. I'm not sure how that really works for coroutines though since you still need to deal with stack allocations. So replace "non-portable assembly" with "possibly non-portable C hacks". I haven't gotten around to digging deep in the code yet though. If libmill's approach is future-proof then that'd be pretty awesome.
Is there any big difference from this vs https://swtch.com/libtask/ , which is a port of much of the same concept from Plan 9 - which again ties back to Go.
Language features very rarely get locked down to specific languages. Go doesn't own green threads... after all the term "green threads" predates Go. Nor does it own CSP, and it wasn't the first implementation. Also witness the number of languages that got hooked up to the various kernel-poll event loops over the past 10 years. Nowadays labeling a language as "supports event-based polling" is right up there with saying it supports closures... of course it does, it's the bare minimum of what anybody will look at anymore.
The things that distinguish languages are typically in other dimensions.
Not sure what your point is. That there's some argument about performance? There is no way that Golang can keep compete with performance of straight C code.
Can anyone familiar with this say if it runs one coroutine per thread? How does a coroutine get it's own stack? If coroutines share threads, is it possible to switch in between a coroutine say when it's blocked on IO like it is in go?
I'm not an experienced go developer but my understanding is that it's possible in go to have millions of goroutines each blocked on IO. I don't see how a C function could mimic that. I thought stack moving was an essential requirement of any runtime that wants to mimic go in this way.
I've looked at some of the special macros like coroutine and go and it looks like libmill runs the entire function fn when you call go(fn) so if fn is blocked on IO then your entire thread is blocked on IO and you can soon grind to a halt if you have even a few blocked coroutines.
It creates a stack per coroutine and if the coroutine would block, it switches to another coroutine instead.
As for threads, it's single threaded (yet concurrent). If you want to take advantage of multicore machines, there's a step in the tutorial about that: http://libmill.org/tutorial.html#step7
One of the reasons goroutines are considered 'lightweight' is that they allocate a small stack to begin with which is reallocated and moved if more space is needed. Of course, stack-moving isn't possible in C for various reasons.
What does libmill do here? Does it allocate a large stack right off the bat? What happens in a stack overflow, does it segfault or will it just trample over whatever is beneath it?
Edit: an ounce of investigation is worth a pound of answered questions, or something like that. It appears stack sizes are configurable, but default to 256kb [1] (for scale, go defaults to 2k), and are retained in a cache of size 64 [#L72] (by default). Also the bottom page is used as a stack guard if posix and mprotect are available [#L90].
From a cursory glance there appears to be some data races in the stack allocation code, though I could be mistaken. E.g. stack.c#L137
Note that large stacks only use address space, not memory, so it's not such a problem (especially on 64-bit).
I've never used libmill but from a glance at the docs it appears they suggest using multiple processes for parallelism, not multiple threads, so there is no such thing as a data race.
I don't think there's a data race, given that the whole thing is single-threaded. Multiple processes should be used for parallelism (see step 7 of the tutorial).
As for the default stack size, it's hard (but doable) to work with smaller stack size in C. For example, humble printf() can allocate a buffer several kB long on stack.
select(), poll(), etc are available in C which enables non-blocking I/O. I dont know how libmill is implemented, but I assume it is using one of those functions for non-blocking I/O.
So it looks like there are facilities for doing some IO, listening on a socket, etc. But I don't see any facilities for waiting on locks, semaphores, etc. Maybe they could build that in a future release or maybe they consider it antithetical to the idea of "share via communicating as opposed to communicate via sharing".
> But I don't see any facilities for waiting on locks, semaphores, etc
There's a very good reason for this. First of all, you can use file descriptors as locks, semaphores, condition variables, etc using pipe(2) and other functions. On Windows you can use WaitForMultipleObjects, etc on Mutex objects.
But select/epoll/kqueue/WaitForMultipleObjects is a system call that goes in to the kernel. That is very inefficient if you want e.g. mutual exclusion using a mutex. In a modern OS, a mutex (or CriticalSection in Windows) is implemented using a futex ("fast userspace mutex"), which is just a spinlock in userspace that only calls to kernel if the mutex is contended. An uncontended futex can be locked and unlocked in less than 20 nanoseconds. A system call can be 100x slower than that.
Some languages, e.g. Haskell have a green threading system that implements mutexes and i/o in a consistent way with i/o using a userspace scheduler. But this depends on language and runtime implementation.
Sort of. They enable non-blocking network IO. Disk IO in unix always blocks (aio excepted). Disk IO in Windows can by async, but not via select or poll.
My understanding is that libdispatch is not a replacement for go's concurrency. With libdispatch each block (closure) that you dispatch runs on a given thread, if that block stalls for any reason then your thread is blocked. So even with a few blocked blocks (heh) your system could be stalled which is not the case with go where you can have millions of goroutines that are blocked.
I've built systems with libdispatch and blocks. I found it much nicer than traditional C style non-blocking IO or typical pthreads, both of which I have also done.
• You don't need to use blocks, there are function callback versions of the calls available, but blocks are nicer.
• There are asynchronous IO routines, that covers most of your stalls.
• Go probably stalls threads on page faults, as does libdispatch.
It looks like the Linux port of libdispatch uses OS threads through pthreads, so it's not a candidate for the user space light-weight threads needed for efficient CSP.
libdispatch is more than just programs submitting blocks to queues – dispatch sources and event handlers can handle channels of asynchronous input as well.
I forgot about that part of libdispatch. You still can't do synchronous IO but you can use libdispatch semaphores, wait on a queue, etc. and that should not block the thread, right?
Sorry for the ignorance of a newbie, but I couldn't find any info on whether libmill does anything similar to the M:N multiplexing of light-weight threads upon OS threads?
But only if it is using the aproved I/O socket operations from the library. What if it uses an operation from a regular socket librayr or if it runs a for loop or some CPU bound computation?
when using native i/o take care to use it in nonblocking manner. if it would block use libmill's fdwait() to wait while the file descriptor is unblocked. in case of long computations call yield() once in a while to give other coroutines a chance to run.
My primary use case was writing network protocols (the stuff I had to deal extensively when implementing ZeroMQ and nanomsg). On one hand you want ultimate performance, i.e. no context switchers, manual memory allocation et c. On the other hand, you want the convenience of writing linear code without caring about what to do when function blocks.
What specifically are you talking about? Go can do syscalls perfectly fine, you can mess in your memory perfectly fine (even if you have to resort to the unsafe package), so what exactly is it where you think Go lacks flexibility compared to C? Because so far, your posting sounds rather hand-wavy.
One problem is that it's very hard to call go libraries from other languages (it requires a runtime). C libraries can be called from almost any language.
There are many go libraries I would love to use, but can't because of this. I expect many others are in a similar boat.
Exactly. I'd argue that the majority of people complaining about Go being unusable for them because of the GC would actually be able to deal with the GC easily. I'm working on a soft realtime application in Go, and the GC has never gotten in the way.
Right back at you though: I'd argue that the majority of people complaining about OCaml (or even the JVM) being unusable for them (and using Go instead) would actually be able to deal with it easily.
It is relevant though. You're talking about the large number of people who are using C and don't need no GC as though they're people who should be using go - but if they don't need no GC they probably don't need a language as low-level as go full stop and would be better served by a more expressive alternative. Ultimately, a large proportion of "people who need go or C" actually need C, making this library a good idea.
You don't even need to use Go, the gccgo implementation of the runtime (including garbage collection) is available from C: https://github.com/stefantalpalaru/golib
Of course, it uses the longjmp/setjmp trick, as suggested by Dmitry Vyukov.
http://www.1024cores.net/home/lock-free-algorithms/tricks/fi...
However, I'm not sure of the benefit of this approach, compared to writing a few lines of stack switching assembly code per architecture. Especially, how portable are the implementations of longjmp and setjmp across the multitude of C libraries and operating systems?
But more importantly, it doesn't seem to solve the two main issues of coroutines libraries written in C that were already solved in Go:
- How does the stack can grow or shrink? (specifying the stack size in the worse case scenario doesn't scale well)
- How can a coroutine be moved to another operating system thread and back? (useful when blocking or when parallelism is needed)