Show HN: uThreads – Concurrent User Threads in C and C++

krat0sprakhar · on Nov 7, 2016

This looks super interesting! I've been working on a Raytracer in C++ and I was recently looking into a threading library which I can use to parallelize the rendering. Surely going to try this out in the coming weekend.

Unsolicited suggestion - while benchmarks and the motivation are important for a threading library, a code snippet of a simple parallel program on the home page would be something that I'd love to see.

Great job, though!

rsp1984 · on Nov 7, 2016

It sounds to me like what you'd need for ray tracing is a fork/join threading model with a master and several worker threads. OpenMP provides exactly that and is a widely supported standard. I'm curious: Why would you use anything else?

CyberDildonics · on Nov 7, 2016

Fork and join is one type of concurrency technique, but to think you wouldn't need anything else is silly. Non shared memory concurrency and pipelined concurrency are two more techniques that can be used.

rsp1984 · on Nov 7, 2016

I was stating this in the context of ray tracing. Why not just use OpenMP and be done with it?

CyberDildonics · on Nov 9, 2016

I answered that question. OpenMP fork-join concurrency can be useful for certain parts of a ray tracer depending on the overall architecture. It is far from the only way and can have many draw backs when it comes to overhead, synchronization and memory locality.

zokier · on Nov 7, 2016

Isn't raytracing almost completely CPU-bound? Seems like a odd use-case for green threads, which afaik are more commonly used for IO-bound tasks.

krat0sprakhar · on Nov 7, 2016

You're absolutely right - raytracing is CPU-bound. However, one can still parallelize per-pixel rendering computations on separate cores. Here's an example of a path-tracer written in Go, spawning a new go-routine for this very use-case https://github.com/fogleman/pt/blob/master/pt/renderer.go#L6...

comex · on Nov 7, 2016

If you just want to start num-CPUs threads like that code does, there is no advantage of green threads over regular ones, std::thread in C++. If you want to start a thread per pixel or something, green threads could work, but it's probably more efficient to keep track of state manually.

CyberDildonics · on Nov 7, 2016

Yes, using green threads for ray tracing is basically nonsense, since keeping actual threads busy is no overly difficult.

tomjakubowski · on Nov 7, 2016

It's not nonsense if, to render each pixel, you issue an HTTP request containing the scene/eye data and requested pixel coordinate, and wait for a response to come back from your magic render farm that lives in "the cloud". Now your "ray tracer" is totally I/O bound! :-)

edvinbesic · on Nov 8, 2016

Not sure if /s, but, to rasterize a 1920x1080 image you would make 2073600 HTTP requests? Sound reasonable.

Someone · on Nov 8, 2016

Double that if you want to PUT the pixels on a screen.

saman_b · on Nov 7, 2016

Thanks! good point; now that I look at the page, there is not a single sample code in there. I'll update it soon.

krat0sprakhar · on Nov 7, 2016

Awesome! As an example, Rayon[0] does a very good job (IMHO) at this.

[0] - https://github.com/nikomatsakis/rayon

saman_b · on Nov 7, 2016

Thanks, oh all those fancy functions. I need to improve the interface a bit, as for now everything is only based on using uThreads as the unit of concurrency. e.g., this is a recursive Fibonacci: https://github.com/samanbarghi/uThreads/blob/master/test/Fib... and a fork-and-join Fibonacci: https://github.com/samanbarghi/uThreads/blob/master/test/Fib...

There is no fork-and-join in uThreads yet, and I created it using create and join. The interface will improve in the future :)

jcbeard · on Nov 8, 2016

How would something like this differ from something like Sandia National Lab's Qthreads (http://www.cs.sandia.gov/qthreads/)? Seems it's a tried and true solution in C that also works with C++11 (committed a test case for C++11 myself)...It is also an optional underpinning for some relatively big-name frameworks like Kokkos, Chapel, RaftLib, etc.

saman_b · on Nov 8, 2016

Thanks for mentioning this, it is indeed very related and very interesting. I am not sure why not me or people around me were aware of Qthread, it has very good support for various architectures and provides many interesting features. It has many similarities with I have in mind for a concurrent library, and even some research goals seems to be very close to mine. Specially the notion of affinity and locality is what I am focusing on in uThreads. I am going through the papers and the source code at the moment to see what are the similarities and differences.

uThreads is still a work in progress and I have specific plans for it in the future that might differ of what Qthreads is trying to accomplish. For now my focus is more on providing auto tuning of Clusters based on the workload. I also will try to explore the pros and cons of uThread migration based on Cluster placement (NUMA and cache locality), and from the 2008 paper it seems that it is what you are trying to study as well. I am open to collaboration if there is an ongoing study around this topic.

jcbeard · on Nov 8, 2016

ooh, ok. you might also check out openshmem (http://openshmem.org/site/) and openucx (http://www.openucx.org). I'd been planning on integrating both in RaftLib just not enough cycles to get it done yet. These combined would make it much easier to maintain a relatively portable yet performant back end. My thesis research was all about locality, memory placement, and throughput for big data systems. Current work is similar but I'm not neck deep in the hardware dev world. There are current research efforts on my part outside of work, most center around the raftlib.io platform. Before I forget, you might also want to check out the graph partitioning frameworks like metis and scotch...both are used in some MPI frameworks for more optimal partitioning. To get topology data you might want to look at the hwloc framework, it's cross platform and provides input for things like NUMA/cache/PCIe topology for optimization. I haven't had a chance to integrate this hook into RaftLib, however it's just a few lines away once I find the time. If you're wondering...I started out writing my own fiber library for RaftLib. Had ports for both IBM Power and Intel, but it gets a bit tiring maintaining/optimizing for every new architecture. Qthreads and the like have been used in HPC circles for quite awhile, so it made sense. There was no way I was beating them for dev time, so might as well join them.

Based on your auto-tuning discussion...RaftLib aims to do something similar, but for big-data stream processing workloads. Here's my 2016 IJHPCA paper: http://hpc.sagepub.com/content/early/2016/10/18/109434201667...

It looks like it's behind a paywall so if you don't have access I'll update my website with the "author archive" copy sometime today...will be at ( http://jonathanbeard.io/media ) once I update it. Bottom line if there's intersected interest, definitely open to collaboration :).

saman_b · on Nov 9, 2016

Thanks, so much information to absorb, let me go through all this and get back to you. I shoot you an email when I processed all this :)

jtsylve · on Nov 7, 2016

Is there any reason you chose GPL3 or would you consider a less restrictive license like Apache or BSD?

saman_b · on Nov 7, 2016

Yes, there is a specific reason behind it, but in the future I might consider a less restrictive license.

jtsylve · on Nov 7, 2016

Thanks for the reply and your future consideration. I'm glad you seem to have taken the question as it was intended. I certainly didn't mean to imply that one license is superior to another.

protomikron · on Nov 7, 2016

Can we please stop calling out nearly every project that uses GPL (3) and call it restrictive? I know there are different opinions and arguing which is "better" nearly always leads to a philosophical debate on principles that gets us nowhere.

Sorry if this sounds snarky, and I do not blame you in particular, but it is just a theme I have encountered here the last years that I find toxic as it portrays the GPL (and FSF) as some evil organization that would restrict "developer's rights".

jtsylve · on Nov 7, 2016

I don't think that the GPL or the FSF is bad or evil. I've released and maintain several projects that are GPL licensed.

GPL is by definition more restrictive than Apache and BSD, since there are more requirements to use the code, so asking if the developer is willing to consider a less restrictive (not "better") license shouldn't be met with hostility.

My motivation for asking is simply because the project looks interesting and potentially useful; however, the proprietary nature of my current work means that GPL licensed code isn't really an option for me. I absolutely wouldn't hold it against the author if they chose GPL for this or any other reason.

protomikron · on Nov 7, 2016

Fair enough, I know the situation well, it just came up a lot in recent times.

module0000 · on Nov 7, 2016

Not sure what the OP's motivation are for asking "why GPL?", but my motivation when I ask that can translate to something like this:

"Hi, I'd like to use your library, but your license is incompatible with the other <N> licenses in my codebase. Consider BSD please?"

llukas · on Nov 7, 2016

It's not like GPL v3 prevents guy from contacting the author and offering him $ for license.

protomikron · on Nov 7, 2016

"Don't ask the author of a library why it is incompatible with proprietary licenses. Ask why your codebase is incompatible with open licenses."

JFK

jimktrains2 · on Nov 7, 2016

Is there a reason you'd choose Apache or BSD instead of a freedom preserving license like GPL3?

striking · on Nov 7, 2016

Licensing GPL software can be a nightmare. Using Apache, BSD, MIT, LGPL means I don't have to give licensing a second thought.

And speaking as someone who writes software for a non-profit, we figure we should just give away the code we write rather than put our own political license on it, especially considering we don't pay taxes in order to make it easier for us to benefit "the people".

aseipp · on Nov 7, 2016

They probably want to use it, make money off it, and give nothing back to anyone. The same as most people who complain about it, IMO. Or, just as likely, they'll integrate it into their own BSD software, which will then be used by someone else to make money, and give nothing back, which might as well be the same thing. (Countdown until 1 person shows up and retorts "Well, my company supports the OSS we use..." as if it's meaningful at all, vs the massive trend of corporations ripping off FOSS and returning nothing)

I don't GPL license a lot of my software (I used BSD/MIT/Apache most of the time, and my company revolves around and writes BSD-licensed code), but IMO, having seen this tired argument over and over again, I'm pretty sure 90% of all GPL complaints come down to this, even if the people don't come out and say it: "I can't make money off of your free work as easily, and that isn't fair to me. Please reconsider." Given this is Hacker News where half of everyone is in a rat-race to make money, I speculate this is a strong part of it.

People just like to wrap it in words like "restrictive" and "viral" to make themselves sound more palatable and reasonable.

And of course, there are also many reasonable alternatives to this library, many which are BSD or permissively licensed, which these people could also use instead -- but that won't stop them from complaining that GPL is unfair or "bad", of course, even though they could pick from a dozen alternatives...

striking · on Nov 7, 2016

It's harder for me to agree with a library being licensed under GPL than if it were an entire application being licensed under GPL.

If someone's licensing an entire application that's usable in its own right, like an SQL server or emulator or game or kernel, then it totally makes sense that anything you build around it should have its code be accessible.

But if it's some smaller part of your entire codebase, I feel it's a little less reasonable for people who aren't working on something that's already *GPL licensed. Especially when you have middlewares or other proprietary pieces of code linked with yours. That, plainly and simply, will prevent you from using the library. (Except if it's the LGPL, which I've never had problems with. I personally like it the most, and I'd use it if I really cared about people upstreaming their changes.)

I'm not going to say "this library is bad because it's GPL'd" but it does mean I would avoid baking it into a programming language runtime, for example.

And it really bothers me that you have such a low opinion of people who comment here, and of people who disagree with you. Maybe it's just an exaggeration for argument's sake, but I feel like there are plenty of reasonable arguments against use of the GPL.

near · on Nov 7, 2016

> They probably want to use it, make money off it, and give nothing back to anyone.

They do the same shit with the GPLv3 as well. Piko Interactive recently backed out of a licensing agreement with me, and just released my GPLv3 emulator in their Steam application without even telling me. When someone called them on it, they said to e-mail them for a link to the code (not enough to take my work for free, they have to play games with their obligations under the GPL.) Which by itself is useless, as it's just a UI modification. The value is the ROM image they don't include in their source, which gets you into a nasty gray area of the GPL.

It's part of the Faustian bargain all open source devs have to make: if you add a non-commercial clause, FOSS proponents will label your work "non-free" and "not open source", and you'll be banished to obscure disabled-by-default nonfree repositories on Linux distros. Which I was until I caved and moved to GPL to avoid punishing my users.

If you don't add that clause, you'll get taken advantage of. We have to rely on people being fair and sharing their profits off our work, and very often, they don't.

CJefferson · on Nov 8, 2016

Because I work on a few GPLv2 codebases, so we can't use GPLv3 code.

michaelsbradley · on Nov 7, 2016

For what kinds of applications would one favor this over, say, the coroutines approach of libdill?

http://libdill.org/tutorial.html

See, in particular, Step 6 of the tutorial.

saman_b · on Nov 7, 2016

In libdill approach, you are probably limited to only multiplex connections over multiple kernel threads. And when a connection is accepted over a kernel thread it has to perform all further instructions over that kernel thread. So it gives you a bit of control over which cores to be utilized but after that you do not have control over what part of the code should be executed on each core.

Using uThreads you can decide what part of the code should be executed over each kernel thread and, if taskset is used, which core to execute your code. You can do this by creating Clusters and using migration or uThread creating in runtime. Thus you can decide which thread is used to multiplex connections and which one is used to run CPU bound code in addition to having a thread pool for example to do the disk IO asynchronously. Ultimately, one can create a SEDA[1] like architecture using uTrheads. Also you can always use uThread as a single threaded application.

------------------------------------------

[1] https://en.wikipedia.org/wiki/Staged_event-driven_architectu...

michaelsbradley · on Nov 7, 2016

First time I've heard of SEDA, though I've been aware of the concerns/concepts it addresses, in various forms, for some time.

Any thoughts on Welsh's Retrospective on SEDA?

http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda...

saman_b · on Nov 8, 2016

I have seen this before. Those are very good points, and I am trying to move this library towards supporting a SEDA type architecture with dynamic control and auto tuning runtime parameters.

anonymousDan · on Nov 7, 2016

Interesting, I've recently been looking for a user level threading package in C/C++. I ended up settling on lthreads: http://lthread.readthedocs.io/en/latest/intro.html

Does anyone know how it compares?

saman_b · on Nov 7, 2016

Hi, I developed uThreads. I looked at lthreads quickly, and it seems lthreads only maps multiple coroutines onto a single pthread (N:1). Although, it adds the possibility of running multiple pthreads, but each pthread can only run their local lthreads (using M threads that do N:1 mapping). However, in uThreads, uThreads can be multiplexed over multiple pthreads (thus M:N mapping). Also lthreads scheduler is based on epoll/kqueue per pthread, and uThreads is using run Queues to manage uThreads which has less overhead. Per pthread epoll/kqueue can mean better scalability for large number of threads in comparison with uThreads that is relying on a single poller thread. But since the poller thread and synchronization is very low overhead in uThreads, the scalability is not an issue (Experiments to up to 16 threads show that uThreads scale very well). Although lthreads provide compute boundaries and async IO to move lthreads over other pthreads, but this process seems to be very expensive. uThreads does not provide these features, but it provides more flexibility and control to the developer by providing migrations. Developers can use migration at any point to move the uThread to another set of kThreads to execute tasks asynchronously (By defining Clusters of kThreads, e.g., IO cluster or Compute Cluster).

scott_s · on Nov 7, 2016

I recommend trying to get access to larger machines with more hardware parallelism. I have seen techniques that scale just fine to using 16 threads, but hit serious limitations when you get to over 100 threads.

saman_b · on Nov 7, 2016

You are right, I have access to machines with higher number of cores, but they have multiple sockets and at some point I need to address the cross NUMA cost which adds a whole new level of complexity and design decisions.

For sure at some point the poller thread will be saturated and the program will not scale past a certain number of threads. I used to have a poller thread per cluster for better scalability, but that would add overhead for migrations between clusters, thus I had to remove it for now until I can somehow find a low overhead solution. uThreads is a work in progress and all these need to be carefully considered in the future :) Thanks for your feedback

scott_s · on Nov 7, 2016

Sometimes, the techniques you use to scale to 100s of threads solve some NUMA issues by virtue of the fact that in order to scale that high, you need to avoid touching as much non-local data as possible. I think it's better to just deal with the pain now and start running your experiments on as large of a machine you can get access to. You can still put off explicitly designing for NUMA, but you want to avoid spending too much time and effort designing for the lower end of the scalability spectrum.

ndesaulniers · on Nov 8, 2016

> uThreads is using run Queues

This sounds like the Linux kernel. I'm curious to understand why copying this logic into user space is worth while?

saman_b · on Nov 8, 2016

I am not sure if you are referring to the runQueue being used in Linux or the whole approach. I try to answer both:

Run queues can be part of any scheduler, they are queues with runnable tasks.

But as why the approach is worth while in user space, it has to do with low cost of operations (context switches) in user space, and also using cooperative scheduling instead of preemptive. Cooperative scheduling provides more control to the user over the tasks, and also has lower overhead since there is no need to manage a quantum for each taks (thread).

RotsiserMho · on Nov 7, 2016

I haven't tried either but I've been meaning to take a look at https://github.com/Amanieu/asyncplusplus.

purple-dragon · on Nov 8, 2016

I've had success with Intel's Threading Building Blocks. Is there a reason I might prefer something like this instead?

saman_b · on Nov 8, 2016

Seems like a very mature library, I can't answer your question before I go through their documentation and code. Also, I find your question a bit abstract, since I am not aware of the details of the problem you are trying to solve, I cannot reason why any library or tool can be better over other libraries. In your case, TBB might fit your requirements and it might be hard to give a reason to switch to another library.

medi11 · on Nov 8, 2016

How does it compare to boost.fiber (http://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/ind...)?

valarauca1 · on Nov 7, 2016

Impressive.

8KiB stacks are a bit on the small side though for production usage. Go gets away with this because they're stacks act more like Vectors then flat arrays.

Why did you decided to roll your own stack swapping software instead of using say using `boost::context`?

saman_b · on Nov 7, 2016

Right, however segmented stacks are have high overhead and stack copying is not very easy in C/C++. Thus, for now uThreads only support fixed size stacks, I know it makes it harder to be used in production, and in the future I might provide optional segmented stacks. As for why not using `boost::context`, I am implementing uThreads as part of my research in uwaterloo, I wanted to have full control over the code and be able to optimize for performance as much as I can. Thus, tried to avoid relying on any third party code when I started :)

naasking · on Nov 7, 2016

Why not just use StateThreads? A comparison against other options would be illuminating.

saman_b · on Nov 7, 2016

Good idea, I probably write a blog post on this later. If you take a look at [1], I explain the difference between N:1 and M:N mappings. StateThreads uses a N:1 mapping which means you can multiplex many fibers over a single thread and to take advantage of multi-processors you can have M processes that do N:1 mapping, which is a common practice (libmill, libdill, ...). But with uThreads you can multiplex many fibers over many kernel threads.

---------------------------------

[1]https://github.com/samanbarghi/uThreads#what-are-uthreads

naasking · on Nov 7, 2016

Sure, I get the M:N argument, but that wasn't something you could extend existing solutions to support?

saman_b · on Nov 8, 2016

Do you have anything specific in mind? The only code I found similar to this is uC++ [1], which has way more features and more sophisticated scheduler. I am using this as part of my research and wanted to have sth very simple.

For all N:1 mappings, since there is only a single process, there is no need for synchronization, also there is no need for any scheduler as a simple queue suffice. But as soon as multiple threads are introduced, there is a need for orchestration among threads and it also changes all other aspects of the code. Of course, I could develop on top of an existing codebase, but I suspect I had to change so much that it is better to start from scratch anyway.

------------------------------------

[1]https://plg.uwaterloo.ca/usystem/uC++.html

naasking · on Nov 8, 2016

> For all N:1 mappings, since there is only a single process, there is no need for synchronization, also there is no need for any scheduler as a simple queue suffice.

Wouldn't an M:N model simply amount to work stealing among N kernel-thread-local queues? This seems like it should be a pretty straightforward extension to one of the user-level C thread packages.

Or are you doing something more elaborate for your research?

saman_b · on Nov 8, 2016

Well on the surface, yes! when I started I expected the same. For now there is no work stealing among kThreads. But let me dig into the problem a bit deeper so you get an idea why it might not be very straight forward (also it very much depends on the use case).

Usually these libraries either provide their own queue or rely on underlying event system e.g. epoll/kqueue to manage the state of the light weight threads. Also, the other main part is IO multiplexing, which can be done using select/poll/epoll/kqueue ...

Lets say if they are using some sort of queue, since there is no other thread in the system, thus no synchronization required and its as simple as pushing to the tail of queue and pulling from the head. Now, if I add more threads, now I have to think how I want to synchronize among threads. The straight forward way of doing this is using the same queue and multiplex among kernel threads and use Mutex and CV. However, Mutex has high overhead and is not very scalable (pthread_mutex does not scale well under contention in Linux). What about N-kernel-thread-local queues along with Mutex? Well, if you see the documentation this approach does better but has high overhead as well. What is next? remove the queue and add some sort of lock-free queue, either MPMC, MPSC, or SPSC. This requires some work to determine which one has lower overhead, in my case I have not tested SPSC, and for now settled with MPSC. So the queue part is totally gone, since they probably did not care about all this and used a simple queue.

Next, comes the IO multiplexing. Relying on an epoll instance per kernel-thread is absolutely fine specially for cases where uThreads stick to the same kernel thread, but as soon as I introduce the notion of migration then moving from on kernel-thread to another kenrel-thread means issuing at least two system calls (in case of epoll, deregister from current epoll instance and register to the new one). This has high overhead, so I need to provide some sort of common poller that do multiplexing over multiple kernel threads, but due to having more than one kernel thread, it means connections should be properly synchronized as more than one thread might try to access a connection. Also, it has to be done in a way with low overhead as migrations for my research require to have very low overhead. Thus, the IO multiplexing part should be replaced.

So the main parts are required to be changed, and I believe the effort required to make fundamental changes to an existing system might be more than the efforts required for writing it from scratch. Also for each part implemented, I did performance optimisations and build on top of that which helped to keep the performance at an acceptable level, it would be hard to do the same with an existing system as it requires to isolate various parts and optimise each part, which requires additional effort.

I hope it makes it more clear :)

naasking · on Nov 8, 2016

I agree there are additional complexities with M:N. I was assuming a lock-free queue, since you'll need this for scalable work-stealing.

I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

I'm not sure whether this would scale as well as what you're doing since you mentioned an epoll-per-kernel thread, but I wouldn't be surprised if it got close since it's so I/O-bound.

saman_b · on Nov 10, 2016

> I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc. and either has its own queue from which other threads steal, or simply pushes results onto a random queue when requested I/O are complete. This accomplishes the migration I believe you were describing.

> When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke a reserved file descriptor to notify the I/O thread, which then reshuffles its file descriptors and again calls epoll/kqueue.

If I remember correctly, what you described is similar to how golang perform I/O polling since they are using work stealing, except there is no dedicated poller thread, and threads take turn to do the polling whenever they become idle.

Also, with epoll/kqueue there is no need to enqueue uThreads and you can simply store a pointer(a struct that has reader/writer pointers to uThreads) with the epoll event that you create and recover it later from the notification, this way you can set the pointer when need to do I/O.

I agree a single poller thread does not scale as well as per kthread epoll instance, and that requires some meditation to provide a low overhead solution that does not sacrifice performance over scalability.

avdicius · on Nov 8, 2016

I did something similar in plain C:

https://github.com/ademakov/MainMemory