Implementing, Abstracting and Benchmarking Lightweight Threads on the JVM

justinsb · on Feb 6, 2014

I noticed that you send the "native threading" case through your library as well. Have you compared to just using "naive" Java - Threads and a BlockingQueue?

Also: if the Google patches for the user-mode threading are adopted, will Quasar have any advantages over a JVM that uses the same syscalls? Can you explain where this would come from?

I think what you've done is genuinely cool, I'm just trying to better understand what the 10x advantage actually comes from.

pron · on Feb 6, 2014

The channels used are just like BlockingQueue. They're a queue with a synchronized condition variable.

If user-mode threading is adopted, Quasar could work without instrumentation, but instrumentation is a very small part of the Quasar code base.

Quasar gives you the scheduler, channels, actors, etc.

The 10x performance boost comes from the fact that there can be non-negligible latency from the time you unpark a thread to the time it starts running, while for fibers that latency is much shorter.

justinsb · on Feb 6, 2014

I looked and your wrapper did look similar (and not obviously incorrect), but if you plan on publishing this benchmark, I would suggest including a comparison against raw Java code, for credibility's sake.

I guess I'm really wondering: let's say Java 9 mutexes use Linux 4's user-mode threading syscall; in that case do we need Quasar's scheduler, channels, actors etc. Or can we just use "good-old" threads and mutexes? Where would Quasar's benefits come from?

It sounds to me like the subtext to Google's patches is that rather than accepting the conventional wisdom that "threads don't scale", they've instead just fixed threads.

pron · on Feb 6, 2014

That's not how those kernel modifications work. You can't just use them with a mutex. The idea is that a thread will be able to say, I'm yielding the CPU to this other thread. When you unlock a mutex you don't necessarily want to park yourself. These changes require either an app-level scheduler, or the use of synchronization mechanisms that can better specify what you want in terms of scheduling. An example for such a mechanism would be an API that says: I'm sending a message to this other actor, but I'm going to wait for it to reply. In this case, the implementation would tell the OS, switch me out and switch that other guy in instead.

justinsb · on Feb 6, 2014

OK, I was being sloppy in my phrasing (and probably thinking also)!

Trying again: Taking your example benchmark, you aren't really calling any special methods that provide any hints for cooperative threading (to my untrained eye). That's great - you've got a great abstraction. But then, what opportunities for optimization does Quasar have, that are not also available to a JVM using the magic syscall?

I'm sure there's something here, but I'd appreciate a hint!

pron · on Feb 6, 2014

In theory? Absolutely none. But Quasar is here today (and also has an excellent actor system, a nice Clojure API and more).

justinsb · on Feb 6, 2014

Well, I appreciate the honesty!

I'm excited by the idea that threads are going to be "the right way", once these improvements make it out of the 'plex.

I also like that I can get a similar API today with Quasar :-)

pron · on Feb 6, 2014

Just to clarify: it's not that easy. The syscalls are the first step, and then you'll need a scheduler. Once you have those two, you still need new synchronization mechanisms and APIs.

Quasar doesn't just provide lightweight threads. It has rich libraries that help you make the best of them.

justinsb · on Feb 6, 2014

Yes, I'm thinking of the big picture. Might be more like Java 12 than Java 9... Or Quasar today!

donjigweed · on Feb 6, 2014

"because it uses macros, the suspendable constructs are limited to the scope of a single code block, i.e. a function running in a suspendable block cannot call another blocking function; all blocking must be performed at the topmost function. It’s because of the second limitation that these constructs aren’t true lightweight threads, as threads must be able to block at a any call-stack depth"

Can you elaborate on this a bit? Let's say I have a function called 'fetch-url' which takes a core.async channel as an argument and makes a non-blocking http request (say, using http-kit), and in the callback handler i put the result onto the channel. If I'm in some other function, in which whose body I open a core.async go block and call fetch-url from within that go block, everything is still asynchronous is it not?

pron · on Feb 6, 2014

If you're using callbacks at all, then you're not blocking. The main advantage threads (and lightweight threads) have is that they can block.

What you can't do is this:

  (defn foo [ch]
     (go 
       (bar ch)))

  (defn bar [ch]
     (<! ch))

foo starts a go block which calls bar, which then blocks on the channel. For threads that's ok:

  (defn foo [ch]
     (thread ; not sure about syntax here
       (bar ch)))

  (defn bar [ch]
     (<!! ch))

So a function running in a thread can call another function that blocks. A go block can't, that's why go blocks aren't lightweight threads.

BTW, in Pulsar's implementation of core.async, the first example is ok, too.

RyanZAG · on Feb 6, 2014

Any chance of someone putting together a benchmark for http://www.techempower.com/benchmarks/ for quasar? It would be nice to see how it compares to other techniques.

Fasebook · on Feb 6, 2014

Wouldn't this kind of development target be better served by optimizing small C/++ programs instead of trying to optimize to some abstract virtual machine implemented on top of the hardware? I mean if speed really is your goal, why not do it correctly instead of hitting yourself in the face with an extra tree before starting?

pron · on Feb 6, 2014

Why do you assume that the JVM adds overhead? While in some cases a program is better served by C/C++ manual memory management and fine-tuned memory alignment, this is not usually the case.

You can think of the JVM as a very good optimizing compiler that compiles your program when you load it in a way that's tailored to your environment.

Also, when it comes to concurrency support, the JVM is usually years ahead of C++ (lock-free data structures, etc.). If you're doing concurrency, the JVM is usually a better target than C++.

kasey_junk · on Feb 6, 2014

Not to mention the kinds of programs that would most benefit from lightweight threads are high connection count servers. Precisely the kinds of applications where the JVM weaknesses are most hidden (startup time, base level latency, etc).

heavenlyhash · on Feb 6, 2014

It's possible to perform compare-and-swaps in java just like it is in C. They compile right down to the same primitives that flip bits in metal: you'll get CMPXCHGL instructions (using x86_64 as an example) from the jvm just as you will from gcc.

Fasebook · on Feb 6, 2014

But those instructions are embedded within another construct.