Do you have anything specific in mind? The only code I found similar to this is ...

naasking · on Nov 8, 2016

> For all N:1 mappings, since there is only a single process, there is no need for synchronization, also there is no need for any scheduler as a simple queue suffice.

Wouldn't an M:N model simply amount to work stealing among N kernel-thread-local queues? This seems like it should be a pretty straightforward extension to one of the user-level C thread packages.

Or are you doing something more elaborate for your research?

saman_b · on Nov 8, 2016

Well on the surface, yes! when I started I expected the same. For now there is no work stealing among kThreads. But let me dig into the problem a bit deeper so you get an idea why it might not be very straight forward (also it very much depends on the use case).

Usually these libraries either provide their own queue or rely on underlying event system e.g. epoll/kqueue to manage the state of the light weight threads. Also, the other main part is IO multiplexing, which can be done using select/poll/epoll/kqueue ...

Lets say if they are using some sort of queue, since there is no other thread in the system, thus no synchronization required and its as simple as pushing to the tail of queue and pulling from the head. Now, if I add more threads, now I have to think how I want to synchronize among threads. The straight forward way of doing this is using the same queue and multiplex among kernel threads and use Mutex and CV. However, Mutex has high overhead and is not very scalable (pthread_mutex does not scale well under contention in Linux). What about N-kernel-thread-local queues along with Mutex? Well, if you see the documentation this approach does better but has high overhead as well. What is next? remove the queue and add some sort of lock-free queue, either MPMC, MPSC, or SPSC. This requires some work to determine which one has lower overhead, in my case I have not tested SPSC, and for now settled with MPSC. So the queue part is totally gone, since they probably did not care about all this and used a simple queue.

Next, comes the IO multiplexing. Relying on an epoll instance per kernel-thread is absolutely fine specially for cases where uThreads stick to the same kernel thread, but as soon as I introduce the notion of migration then moving from on kernel-thread to another kenrel-thread means issuing at least two system calls (in case of epoll, deregister from current epoll instance and register to the new one). This has high overhead, so I need to provide some sort of common poller that do multiplexing over multiple kernel threads, but due to having more than one kernel thread, it means connections should be properly synchronized as more than one thread might try to access a connection. Also, it has to be done in a way with low overhead as migrations for my research require to have very low overhead. Thus, the IO multiplexing part should be replaced.

So the main parts are required to be changed, and I believe the effort required to make fundamental changes to an existing system might be more than the efforts required for writing it from scratch. Also for each part implemented, I did performance optimisations and build on top of that which helped to keep the performance at an acceptable level, it would be hard to do the same with an existing system as it requires to isolate various parts and optimise each part, which requires additional effort.

I hope it makes it more clear :)

naasking · on Nov 8, 2016

I agree there are additional complexities with M:N. I was assuming a lock-free queue, since you'll need this for scalable work-stealing.

I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

I'm not sure whether this would scale as well as what you're doing since you mentioned an epoll-per-kernel thread, but I wouldn't be surprised if it got close since it's so I/O-bound.

saman_b · on Nov 10, 2016

> I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

> When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

If I remember correctly, what you described is similar to how golang perform I/O polling since they are using work stealing, except there is no dedicated poller thread, and threads take turn to do the polling whenever they become idle.

Also, with epoll/kqueue there is no need to enqueue uThreads and you can simply store a pointer(a struct that has reader/writer pointers to uThreads) with the epoll event that you create and recover it later from the notification, this way you can set the pointer when need to do I/O.

I agree a single poller thread does not scale as well as per kthread epoll instance, and that requires some meditation to provide a low overhead solution that does not sacrifice performance over scalability.