How Erlang does scheduling

davidw · on Jan 14, 2013

A very interesting read. I'd be interested to hear commentary from people who use languages like Haskell, that I'm not very familiar with, what their take on this is.

Amongst a lot of positive things one can say about this scheduling system, I think there are a couple of downsides:

* The scheduler needs to be able to interrupt things, so my guess is that it's harder to compile code down to native code, otherwise you might not be able to interrupt a tight loop.

* The bit about regular expressions being implemented in Erlang so as to be able to keep track of how much time they've taken up. This has a cost: slower regular expressions because you're doing them in Erlang rather than C, and not sharing nice C implementations. The one in Tcl is pretty good and very liberally licensed, for instance. Turtles all the way down means having to reimplement stuff instead of just borrowing it.

That said, for certain things, Erlang is superb.

jlouis · on Jan 14, 2013

(author here)

Haskell and Go just take a different approach. For some problems that approach is way better, so it is not a question of what the best way to do this is. In fact, the code I have written in Haskell does not seem to have suffered at all, even though it was highly concurrent. Do note however, that the thing I wrote was throughput-oriented though.

It is not harder to do this with a compiled language. You just need to insert check points regularly in your generated code. This means you have to make sure that loops regularly checks. Some times though, you can compute the induction and then determine before loop execution if you have enough reductions left to enter the loop. If not you can jump to a slower version. Depending on the loop and architecture, incrementing a value on each iteration tend to be cheap.

As for the regular expressions: They are PCRE's written in C, but the library has been instrumented with points at which the computation can be stopped and preempted to avoid it to hog the core. The slowdown is very very low, but it will be split up and interleaved with other jobs.

davidw · on Jan 14, 2013

Perhaps I'm thinking of an older version or something. It's listed as being quite slow here:

http://benchmarksgame.alioth.debian.org/u32/performance.php?...

And the Erlang code even has the comment "re module is damn slow.", but I admit to not following Erlang super closely and maybe that's no longer accurate.

I'd be interesting to see if there are languages out there that compile to native code that do what you say. I know Erlang has "hipe", but I'm not sure how that works under the hood - maybe another blog post?:-) HiPE seems to make Erlang fast, but not quite in the same ballpark as JIT'ed languages:

http://benchmarksgame.alioth.debian.org/u64/which-programs-a...

jlouis · on Jan 14, 2013

It should be pretty obvious that something is amiss here.

Erlang is usually not the fastest language for (purely) CPU-bound tasks, of which this is the case. But being 34 times slower than a Javascript engine tells you that something is not done right. In fact, there is an interesting alternative program which is only a factor of 2 slower. I would expect it to be slower since regular expression handling is something you rarely do in Erlang on critical paths. But 34 times slower is way above target - and that tells me something is going wrong in a deep way in that code. Also in the real world you are allowed to cheat on heavyweight computation, luckily. Write it in C, Ocaml, OpenCL, CUDA, GHC+DPH+REPA or what have you. Then use Erlang to manage the data as payload and move it around.

I have no knowledge of HiPE. At least not enough to write a blog post on how it works right now. I would have to study it a bit before I can do that. There are plans for an alternative JIT'ed backend for Erlang though. That said, I don't think Erlang will ever be "fast" since the language isn't statically typed, and there is currently no provision for type annotations.

The main reason for having a JIT for Erlang would be that more of the runtime can be written in Erlang, which makes the system more portable. This is almost always a win.

knewter · on Jan 15, 2013

So javascript is actually the fastest implementation of regular expressions, so I'm not sure why something's "obviously amiss."

ibotty · on Jan 15, 2013

the factor of 34 is. (and that there is a program with only factor 2 suggests that as well.)

igouy · on Jan 15, 2013

How does "the factor of 34" tell us that something's "obviously amiss"?

Maybe that's more than you expected, but incredulity is a weak argument.

Why does there being "a program with only factor 2" suggest that something's "obviously amiss"?

Maybe that program does not use regular expressions for all the substitutions.

https://alioth.debian.org/forum/forum.php?thread_id=14805...

igouy · on Jan 15, 2013

>> It should be pretty obvious that something is amiss here. ... 34 times slower is way above target - and that tells me something is going wrong in a deep way in that code. <<

Please tell us what is "going wrong in a deep way".

martinced · on Jan 16, 2013

Intellectual dishonesty is certainly wrong in a deep way.

Several of us here obviously have our jaws dropping thinking: "What!? 34 times slower when typically perfs are quit ok, what is going on here?!"

Where do you put the limit? You realize that your "counter argument" makes no sense.

918 times slower? Would that be enough?

What about 8734 times slower?

Where do you draw the line?

Why do you keep insisting saying that "34 times slower" is "nothing to see here, move along"?

What about 1923408130294870918234 times slower? Would that be enough for you to consider something may be wrong?

igouy · on Jan 16, 2013

>>"What!? 34 times slower when typically perfs are quit ok, what is going on here?!"<<

"perfs" to do what exactly are "quit ok"?

Please show us your regex processing Erlang program and your time measurements.

---

fyi exactly the same C program can 3,600 times longer using a C interpreter than when using GCC.

http://benchmarksgame.alioth.debian.org/u32/program.php?test...

igor_bukanov · on Jan 22, 2013

JavaScript engines compiles regexp down to machine code using the same JIT implementation they use for JS code. It is not surprising that the result is the fastest regexp matching.

larsberg · on Jan 15, 2013

I worry about and benchmark this sort of stuff on a daily basis (in the context of Manticore, a parallel dialect of Standard ML), so I have a few data points to throw out:

- Pinning to cores is not a good idea unless you have also pinned all other threads in the system to a separate core. Even with Very Smart schedulers, poor interrupts due to OS threads or even just intermittent services can end up stealing a core that a thread is pinned to, requiring the system to notice it and migrate. We have seen identical median execution times while removing all of the really bad outliers by pinning to a package rather than a core and relying on the OS scheduler to try not to blow out L1 cache by moving you around, both on Intel and AMD architectures.

- I am not as familiar with Haskell, but in our implementation of ML we have interruption points the scheduler can hop in at both at allocation locations and at a variety of other points: any loop, a C library call, etc. By doing it at known points but introducing extra ones, we keep it down to 10 or so instructions before a check, with a check that's a single in-memory pointer check (the allocation pointer gets zero'd out, for the curious), _and_ we then produce all of the information that's also needed by the garbage collector if some series of actions leads to the need for a GC at that point. Which can happen even if you weren't allocating if the other threads in the system determine there's a need for a synchronized global collection, for example.

- Not mentioned in this article is anything about the Erlang GC's interaction with the scheduler and NUMA memory. In systems with lots of threads, if you don't have a fairly concurrent GC at the per-thread level combined with global collections that maintain locality (remember each CPU package has a separate chunk of memory!) _and_ your scheduler isn't aware of where the code is running, it's easy to have a poor GC and scheduler decision lead to really bad performance outliers.

Erlang is far more mature than Manticore, so I'm sure they've dealt with this, but it's a key aspect of engineering such a runtime system on modern multicore machines, so I was surprised not to see anything about it.

jerf · on Jan 15, 2013

"a fairly concurrent GC at the per-thread level combined with global collections that maintain locality"

Each Erlang process gets its own independent little memory space. GC is trivial to run per-Erlang-process as a result. The downside is, no sharing between processes, even if you want it for some reason. There is some sharing of binaries with reference counting, though, with the usual problems of potentially accidentally holding on to megabytes of memory because one process is holding on to a four-byte slice. (Though you can make copies in advance if you know that's going to happen, and usually you have or can get a pretty good idea.)

I don't know anything about specific NUMA support. A quick Google turns up little to nothing. It is possible that just by the way it is structured it tends to work out reasonably well; less well than perhaps it could with deep integration, but better than a program written in most languages with no thought given to it. Processes and the RAM they use to end up tightly bound together in small contiguous memory arenas just by the way Erlang works. (It is not uncommon to have a system with thousands and thousands of processes that each may only be holding on to a few hundred bytes of data that actually belongs to them.)

I've been learning Go lately. I so badly want a decent Erlang replacement whose syntax and basic libraries are so much more friendly. And I respect Go and think it has a future, perhaps even a bright one. But I have to admit it has only deepened my respect for Erlang. Erlang is the shy librarian of computer languages; the outside may be superficially unappealing, socially awkward, not like the other glitzier girls, but at the end of the movie she takes her glasses off, lets her hair down, and dazzles the audience by being the best dancer there is... if you stuck with her long enough to get to the end of the movie.

rdtsc · on Jan 15, 2013

> Not mentioned in this article is anything about the Erlang GC's interaction with the scheduler and NUMA memory.

Great point. Erlang has one of the best GC implementations. Not because it is fancy or has sophisticated algorithms in it or clever trickery ,it is one of the best because of how the rest of the system was thought out and engineered. The main idea is this, since processes are isolated _and_ there is single assignment, and processes have their own private heap, GC now becomes something elegant, obvious and something that doesn't break the system under load. No need to coordinate a stop the world scenario among 100K running processes serving requests.

rdtsc · on Jan 15, 2013

> I think there are a couple of downsides

It's a trade-off. There is no free lunch. Erlang besides high fault tolerance also has soft real-time capabilities as a goal. Some people don't like that term, but it basically means trying to minimize the latency of the system (even under load).

It is one thing to analyze the system in a sequential manner, compile, optimize the hell out of it, down to assembler, bring up a client, measure and everything looks great. Now it comes time to deploy the system in the field. In the field hundreds of thousands of clients are supposed to connecting and we find that there are so many are dropping because of timeouts and the system is spewing socket errors.

So what happens. The system that was optimized for sequential throughput will not necessarily perform well under concurrent load. The two are often at odds with each other.

So Erlang authors who slowed down the sequential code by inserting those pesky interruptions in the middle of what could be tight C loops, just optimized for a different use case.

kaeluka · on Jan 15, 2013

what the article fails to mention is that BIFs can mess with your timing -- a BIF call costs one reduction, no matter how long it takes. I'd say that this is comparable in seriousness as the array-operation example, since BIF calls are probably more common.

edit: none the less, I like erlang, it is a brilliant tool for a small but important class of problems. When comparing erlang to akka recently, I found that erlang, indeed does a quite good job in delivering predictable performance (all on a very limited benchmark), while akka is less reliable in that regard. However, I did construct a very simple network topology that was as similar in both cases as possible, so that excluded features like routers in akka.

Akka's performance on average tends to be better, though (roughly same for pure message passing, the more work you add, the more Akka tends to outrun erlang.)

So, it's, as always, a tradeoff.

edit #2: I regard erlang's scheduler to be one of the advantages erlang has: allowing all threads to progress at the same time is amazing. If you compare this with task based parallelism, backed by a thread pool of size N: only N threads are able to 'make progress', while in erlang, every process gets a fair share. That's really, really, cool.

jlouis · on Jan 15, 2013

BIFs are written in a better way than what you think. They all manipulate the reduction counter internally, if they run in complex patterns with loops. This in effect means you avoid the problem where a BIF hogs the CPU.

ot · on Jan 15, 2013

From the article:

  [n2] This section is also why one must beware 
  of long-running NIFs. They do not per default 
  preempt, nor do they bump the reduction counter. 
  So they can introduce latency in your system.

The OP calls them NIFs, but it is essentially what you are saying.

grating · on Jan 15, 2013

A NIF isn't the same thing as a BIF. NIFs where introduced in R13B03 and are in short, user-code written in C. BIFs on the other hand are built-ins implemented in a completely different manner.

ibotty · on Jan 15, 2013

the article mentions bifs. (and i don't think benchmarks are very clear between akka and erlang. erlang is a proven technology though.)

ot · on Jan 15, 2013

I just want to point out that the comments in the article are almost as interesting as the article itself.

Don't miss them.

jensnockert · on Jan 15, 2013

Just wanted to point out that OS X has a thread affinity API, https://developer.apple.com/library/mac/#releasenotes/Perfor... in case anyone actually needed it.