Ruby 2.1: Out-of-Band GC

gary4gar · on Jan 30, 2014

It would be nice to get these patches in core as part of ruby 2.1.1 so others don't need to patch & recompile ruby from source.

Other that, anything that improves performance is a welcome change.

gnufied · on Jan 30, 2014

I think Aman already proposed those changes to ruby-core - http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/...

He is a member of ruby-core too, so I guess if enough people in core like these patches they should go right in. However I don't see OOBGC patch there in that list.

gnufied · on Jan 30, 2014

I have not been reading carefully, just noticed that OOBGC stuff is part of - http://rubygems.org/gems/gctools

exDM69 · on Jan 30, 2014

Very nice to see improvement in Ruby's GC. Back in Ruby 1.8 days, I read the source of the GC implementation and I was less than impressed. I should take the time and revisit that code to see what kind of improvements have been done.

steveklabnik · on Jan 30, 2014

In that time frame, the interpreter has been entirely re-written, and we've gone through three or four different garbage collectors.

So yeah, it should be quite different.

jmcgough · on Jan 30, 2014

This is a bit old now, but this blog post has a great description of the GC changes that came with MRI 2.0: http://patshaughnessy.net/2012/3/23/why-you-should-be-excite...

rurounijones · on Jan 30, 2014

MRI Ruby seems to be making a lot of progress in the implementation details recently. Good to see.

munificent · on Jan 30, 2014

Anecdote time!

I've been hacking on a little scripting language[1] lately. To see how its performance compares, I run a few benchmarks[2] against other similar (dynamically typed, bytecode compiled) languages: Lua, LuaJIT (interpreted), Python, and Ruby.

Like others, I had internalized "Ruby is slow" through osmosis. But the version of Ruby I happen to have on my machine is 2.0.0.

In my little benchmarks, it turns out Ruby is one of the fastest. I haven't compared to 1.8.7, but I'm guessing that was much slower.

[1] https://github.com/munificent/wren [2] https://github.com/munificent/wren/tree/master/benchmark

If you want the gory details, here's the results of running them against Lua 5.3.2, LuaJIT 2.0.2, Python 2.7.5, and Ruby. (Yes, I should try against Python 3. I will.):

                          score time   wren score relative
    binary_trees - wren   3193  0.31s
    binary_trees - lua    1366  0.73s  233.72%
    binary_trees - luajit 6256  0.16s   51.04%
    binary_trees - python 1376  0.73s  231.96%
    binary_trees - ruby   3003  0.33s  106.34%

    fib - wren            3078  0.32s
    fib - lua             2785  0.36s  110.52%
    fib - luajit          6900  0.14s   44.62%
    fib - python          1331  0.75s  231.37%
    fib - ruby            3548  0.28s   86.77%

    for - wren            6080  0.16s
    for - lua             1990  0.08s   50.71%
    for - luajit          5914  0.02s   13.24%
    for - python          2825  0.35s  215.25%
    for - ruby            6595  0.15s   92.19%

    method_call - wren    4707  0.21s
    method_call - lua     1674  0.60s  281.17%
    method_call - luajit  4221  0.24s  111.53%
    method_call - python   767  1.30s  613.69%
    method_call - ruby    3061  0.33s  153.77%

As you can see, Ruby fares quite well.

alberth · on Jan 30, 2014

>> "To see how its performance compares, I run a few benchmarks against ... LuaJIT (interpreted)"

Why would you use LuaJIT in interpreted mode?

That defeats the main reason for using LuaJIT, and you'll see massive improvement in speed if you run it not interpreted [1]

It should also be noted that, even in the slower interpreted mode of LuaJIT - it was still the fastest in completing your benchmarks compared to all other languages/implementations. It will run even faster if you don't run LuaJIT in interpreted mode.

[1] http://luajit.org/performance_x86.html

munificent · on Jan 30, 2014

Running LuaJIT interpreted is a deliberate choice.

Some platforms, in particular iOS and game consoles, do not allow generated machine code. I want my language to be usable for games, and that's one reason I'm using bytecode. (Simplicity is the main reason.)

Given that, I think it's most helpful for me to compare Wren's performance to other bytecode language implementations. Comparing to JIT compiled languages is a bit apples/oranges. They tend to be something like an order of magnitude faster (and an order of magnitude more complex!).

Meanwhile, LuaJIT's interpreted mode is very fast as you note. I think it's one of the fastest bytecode interpreters around, so it's a great target for Wren to aim for.

If you want a lot more detail on my thoughts about this, see here:

https://github.com/munificent/wren/blob/master/doc/site/perf...

the8472 · on Feb 2, 2014

What makes ruby (MRI particularly) slow are metaprogramming things that slow it down when done at runtime (dynamically definding methods and classes), its global interpreter lock (preventing proper parallelism) and the single-threaded super-slow GC.

Integer math is fast because they're using tagged pointers and thus don't need to allocate real objects for them. Method calls also are reasonably optimized, although they don't have inline caches afaik.

You're basically testing a few tiny bits of the runtime that happen to be reasonably fast. The big frameworks like rails do a lot of stuff that is horribly inefficient. Combined the aforementioned single-threadedness and slow GC leads to ruby being slow in real-world applications even if a few microbenchmarks are ok for an interpreted language.

mcguire · on Jan 30, 2014

"dynamically typed, bytecode compiled"

I haven't followed Ruby development, but the last time I heard, Ruby 1.8 and prior was a tree-walking interpreter---simple to implement but not fast at all.

srd · on Jan 30, 2014

This changed in 1.9. Ruby is now compiled into a stack machine based bytecode. Ruby 1.8 used the AST internally, you're right there. This made e.g. method calls very slow compared to the new byte code representation.

steveklabnik · on Jan 30, 2014

1.8 -> 1.9 was an entire re-write, it's now a bytecode compiled language: http://www.ruby-doc.org/core-2.1.0/RubyVM/InstructionSequenc...

munificent · on Jan 30, 2014

That's correct. I still think Ruby 1.8 would be a useful data point just because it's a widely used language that many people are familiar with, but in this case I'm comparing it to Ruby 2.0, so it is apples/apples.

For what it's worth, I look at these benchmarks as validating feasibility more so than really trying to win some performance fight between languages. In order to argue that my language is suitable for real world use, I need to show it's performance is comparable to other languages that are widely used. From that angle, Ruby is a fine comparison, regardless of how it's implemented.

FooBarWidget · on Jan 31, 2014

We've now added support for the Ruby 2.1 Out-of-Band GC in Phusion Passenger: http://blog.phusion.nl/2014/01/31/phusion-passenger-now-supp...

kcorbitt · on Jan 30, 2014

As someone woefully uninformed about these things, why can't GC be implemented as a separate thread, maybe with a lower priority than the primary interpreter? Would a separate thread not be able to count references to objects or something?

judofyr · on Jan 30, 2014

Quoting ko1 (https://bugs.ruby-lang.org/issues/8339#note-11):

    > Parallel tracing needs an assumption that "do not move (free) memory
    > area except sweeping timing". Current CRuby does.
    > For example: "ary << obj". Yes, the CRuby's memory management strategy
    > (assumption) is different from normal interpreters.

m0th87 · on Jan 30, 2014

Isn't memory only shuffled around during compaction? Why not mark/sweep in parallel, then stop-the-world at compaction time?

Skinney · on Jan 30, 2014

In a nutshell, generational GCs works by scanning live data. Thus, anything not scanned is garbage. While the application is running, it also allocates new data.

If you allocate an object somewhere --after-- that area has been scanned, the GC will treat that area as garbage, and then you will have memory corruption, leading to segfaults and other nice things.

While parallel mark/sweep is certainly doable (Java does this) it's not easy to get right.

riffraff · on Jan 30, 2014

notice that refers to parallel (multiple GC threads) but doesn't rule out concurrent (one gc thread concurrent to the application thread)

Skinney · on Jan 30, 2014

Wouldn't two threads possibly run in parallel in a multi-core system?

Concurrency, as I understand it, is dividing things into tasks that can run (to some degree) independantly of eachother.

Parallelism is two or more concurrent tasks running simultaniously.

This if you have one gc thread and one application thread, you have the possibility of them running in parallel.

riffraff · on Jan 30, 2014

yes, they can run in parallel.

But in the context of GC (in my understanding) the distinction goes like

* concurrent GC is when (some phases of) the garbage collection happen in a separate thread from the main program.

* parallel GC is when (some phases of) the collection are performed by multiple threads.

So e.g. in a mark&sweep collector you can have a background thread that performs the mark phase concurrently or you can have a parallel collector that uses multiple threads to perform the mark phase.

In MRI you can't do the latter but it could be possible to do the former, according to the thread referenced earlier.

wingo · on Jan 30, 2014

GC needs to know about all references in the program. If the mutator (the program) is running concurrently with the collector, it's difficult (though not impossible) for the collector to construct this consistent view.

Here is a good introductory talk: http://www.infoq.com/presentations/Understanding-Java-Garbag...

mcguire · on Jan 30, 2014

Many styles are implemented that way. Check out Java for (many) examples.[1]

Naively, with Ruby's mark-sweep collector[2], you might want to do the sweep phase in parallel, adding garbage blocks to the free list. You could do the mark phase as well, with I believe a pause to make the marks a consistent snapshot.

[1] http://www.oracle.com/technetwork/java/javase/tech/g1-intro-..., for one.

[2] http://tmm1.net/ruby21-rgengc/

eternalban · on Jan 30, 2014

One issue is shared mutable state. Typically a GC requires some sort of snapshot of the world state to determine liveness of memory objects.

riffraff · on Jan 30, 2014

still, some of the work can be done in parallel I.e. the jvm has had UseConcMarkSweepGC for many many years.

EDIT: what I mean is that there is still a need for locks, but a lot of work can be done concurrently.