Making Ruby Faster

kazinator · on Nov 14, 2014

Though optimizations in an interpreter are valiant efforts, the way to optimize the language is to develop a compiler.

There are good reasons not to start doing hairy optimizations in the interpreter, like that it establishes the semantics which is helpful when compiled and interpreted semantics differ, and that interpreter optimizations will become moot once you have a compiler.

About this Ruby optimization: one thing that stands out is that there doesn't appear to be any way to turn it off. If such a hard-coded interpreter optimization breaks, the only way to try something without the optimization is to revert to a build of the interpreter which didn't have it. That may not be possible, so then you may have to build the current interpreter, but with that change reverted.

Compilers usually have switches for selecting various optimizations. Of course, a similar switch in an interpreter has a run-time impact: a "do this optimization" flag has to be checked each time there is an opportunity to do that optimization.

jeffreyrogers · on Nov 14, 2014

Mike Pall, the writer of the LuaJIT interpreter/compiler, suggests that a good interpreter can get a large portion of the gains you'd get from a compiler. At one point the LuaJIT interpreter (not the regular Lua interpreter) was within about a factor 2 of the JIT. I think the JIT has since improved, but you can still get significant benefits from simply improving the interpreter. (I don't have a reference for this and can't look it up now, so maybe someone else can help me on this... I do, however, remember reading something by Mike Pall that said this. I think it was on Lambda The Ultimate.)

Of course, the best thing to do is to optimize the interpreter and then JIT compile the CPU intensive stuff.

gioele · on Nov 14, 2014

> a good interpreter can get a large portion of the gains you'd get from a compiler.

But there is a fundamental thing that a pure interpreter cannot avoid: compiling the source every time your application runs. This leads to long loading times.

For long lived applications this is not a big problem, but for short lived programs this can be killing. Currently `ls` in a directory with 10k files takes "0m0.008s", `git status` on the same directory takes "0m0.020s". You can barely warm up your interpreter in that time.

riffraff · on Nov 14, 2014

"LuaJIT's interpreter (!) beats V8's JIT compiler in 6 out of 8 benchmarks and is not too far off in another one"

http://lambda-the-ultimate.org/node/3851?a=1#comment-57761

mraleph · on Nov 14, 2014

This was back in 2010 when V8 did not really have an optimizing compiler - V8's "compiler" was a baseline one essentially gluing together individual interpretation patterns.

Also any cross-language comparison should be done very accurately - because we are talking about different language semantics and different benchmark implementations.

riffraff · on Nov 14, 2014

well, the point wasn't "an interpreter is always faster than a jit", but "a good interpreter can get a large portion of the gains you'd get from a compiler".

If you prefer apples to apples, quoting Mike Pall again[0]

"the LJ1 JIT compiler is not much faster than the LJ2 interpreter, sometimes it's worse".

[0]: http://lambda-the-ultimate.org/node/3851#comment-57646

kazinator · on Nov 14, 2014

If improvements in A gives you gains, whereas the use of some other B also yields more gains, then you can always say that any of the A gains are "a portion of" the B gains.

A compiler that only beats interpretation by 2:1 is either a poor compiler or something else is going on, like most of the work actually being done by subroutines that cannot be whose performance is not being affected by the compilation (Because, for instance, they are written in C and intrinsic in the language run-time).

There do not have to be explicit calls to such functions. For instance, compiled arithmetic that is heavy on large bignums will probably not be much faster than its interpreted version, because cycles are actually spent in processing the arrays of bignum digits (or "limbs"), which is done in some bignum library code. The code being compiled looks innocuous; it just has formulas like (a+b)*c, but these turn into bignum library calls. Since the bignum library is written in C and compiled, the calls run equally fast whether called by interpreted or compiled code. That's where most of the time is spent, and so compiling the interpreted code makes no difference overall, even if 90% or more of the time spent there is knocked out by the compiler. (Amdahl's Law.)

mraleph · on Nov 15, 2014

LuaJIT1 compiler was also pretty basic in the way it worked - it did not do any major speculative optimizations afaik, so comparing LJ2 interpreter against it is not of much interest.

Now check out performance graphs of LuaJIT2 compiler+interpreter vs interpreter modes[1].

Anything remotely computationally expensive is 2x faster with compiler and you can go up to 28x for integer number crunching.

I do believe that original point "a good interpreter can get a large portion of the gains you'd get from a compiler" can't be correct simply because it is too broad and ill-defined. What are the gains you expect from the compiler? How can "large portion" be defined? All of these really depend on many things: from the language itself to concrete design decisions in compiler/interpreter.

[1] http://luajit.org/performance_x86.html

strmpnk · on Nov 14, 2014

Twiddling around with the radio buttons on this page show some interesting performance ratios: http://luajit.org/performance_x86.html

munificent · on Nov 13, 2014

Really neat article, though it makes me wonder why they didn't just remove `sym_equal` entirely and replace all usages of it with `rb_obj_equal`. At the very least, that `#define` needs a comment saying it must have the same identity as `rb_obj_equal` or a perf loss will happen.

timr · on Nov 13, 2014

Well, among other things, you can override the object equality check in Ruby -- so you can't simply rely on pointer identity as a check of equality.

That said, yeah, it'd be nice if you could just make the assumption that if two objects point to the same memory, they're de facto identical...

munificent · on Nov 13, 2014

> so you can't simply rely on pointer identity as a check of equality.

Sure, but we aren't talking Ruby semantics here, we're talking C. I don't see why it's useful to still have a sym_equal identifier that's just an alias for rb_obj_equal.

jrochkind1 · on Nov 13, 2014

You can make that assumption for symbols. And that is in fact what the code was doing -- both before and after optimization. The optimization was about something else, weirder, that was still making string comparison (where you can't just check the pointer) faster even though it ought to be slower. Did you read the article?

timr · on Nov 13, 2014

"The optimization was about something else, weirder, that was still making string comparison (where you can't just check the pointer) faster even though it ought to be slower. Did you read the article?"

I'm the author of the article.

The optimization here is that the check for plain-old symbol equality is happening without the need for a full ruby method dispatch. You can't make that assumption in the general case, because it's possible to override equality in Ruby, which then requires more work.

If you look at the full source code for the method in question, you'll see that it does special checks for Fixnums and Floats and Strings, then a check for the default object equality (i.e. does the comparison use rb_obj_equal?), then, finally, it falls back to a full method call.

jenrzzz · on Nov 13, 2014

Actually, he wrote it!

mtdewcmu · on Nov 13, 2014

>> remove `sym_equal` entirely and replace all usages of it with `rb_obj_equal`. ...it must have the same identity as `rb_obj_equal`

That's exactly what #define does. It replaces all usages and gives the token the same identity as the other token. As a C programmer first and foremost, it looks reasonable to me.

munificent · on Nov 13, 2014

Yes, I understand the semantics of #define. My point is that someone seeing a reference to sym_equal may not realize that (1) it's identical (at the C level) to rb_obj_equal and that (2) that fact has important performance ramifications.

First class functions are super useful in C, but I think it's pretty rare to rely on their identity at runtime. If you are doing that, also aliasing the same function by another name seems like a recipe for confusion.

jenrzzz · on Nov 13, 2014

It's an optimization hack. By nature, they aren't that nice to look at.

`sym_equal` only makes 2 appearances in the whole codebase (to be bound to `Symbol#==` and `Symbol#===` in string.c), so there's not a whole lot of room for confusion.

Also, this is Ruby core! Smart people do what works and makes sense to them and don't spend a lot of time worrying about how clear it is to us unwashed masses.

wmt · on Nov 14, 2014

Come, that's just plain wrong to say that the code quality cannot be questioned because it's an optimization and because it's made by "smart people" and it "makes sense to them". No code is sacred.

It's perfectly valid to question using a macro to rename a function instead of just renaming calls, especially if the renamed function is only called twice. It also would've been more pragmatic to mention in the code that rb_obj_equal is being used due to an optimization in opt_eq_func, as it's not obvious.

There might be reasons for using a macro in this case, like wanting to underline the fact that symbols are being compared, but I strongly suspect the author was just focused on changing the implementation of sym_equal to be better, and didn't think of the option of throwing it away.

jenrzzz · on Nov 14, 2014

I didn't say it couldn't be questioned. I was just explaining some possible reasons why the patch was so esoteric, and pointing out that things that seem straightforward to MRI hackers often aren't to people who don't dig through Ruby internals every day.

mtdewcmu · on Nov 14, 2014

>> My point is that someone seeing a reference to sym_equal may not realize that (1) it's identical (at the C level) to rb_obj_equal and that (2) that fact has important performance ramifications.

Exuberant ctags will find the #define that goes with sym_equal and make it quick to locate. Upon seeing that sym_equal is #define'd to rb_obj_equal, it will be clear that they're identical at the C level. It may not be obvious why it's a #define, but a cautious programmer would be on alert that changing it back to a separate function would likely have some kind of implications.

Perhaps the ideal solution would be a large refactor to make the whole system as self-explanatory and surprise-free as possible. While waiting for that, this #define looks like a reasonable and low-risk bug-fix.

ramigb · on Nov 14, 2014

backward compatibility?

jaredcwhite · on Nov 13, 2014

Every tiny little thing like this that gets further optimized in the MRI will result in major performance gains in advanced frameworks/codebases down the road. So this is great news. I'm looking forward to a near-future time when the claim Ruby=Slow! is much harder to make.

_qc3o · on Nov 13, 2014

There is an inherent overhead to all dynamically typed languages like Ruby and Python, method dispatch. Overcoming this problem is not easy and even JRuby with all the muscle of the JVM JIT is still pretty slow compared to the statically typed equivalent in Java or any other language that doesn't have to pay for method look-ups during run-time.

The situation is not hopeless though and projects like Truffle/Graal (https://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+...) are pushing the boundary of what is possible in terms of performance for dynamically typed languages. There is also PyPy and RPython (http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast...) which again leverages some neat JIT techniques to make things fast.

chrisseaton · on Nov 13, 2014

This is how Truffle and Graal make Ruby method dispatch fast (linking to my own blog post) http://www.chrisseaton.com/rubytruffle/how-method-dispatch-w....

nly · on Nov 14, 2014

Good article, thanks. It's amazing how such relatively simple techniques get some fancy names.

gsnedders · on Nov 13, 2014

To be fair, JVMs are typically pretty bad at method dispatch as they typically haven't needed to worry about dynamic dispatch (because, after all, typically the main concern has been Java).

JS engines have done a fair bit around dispatch, but ultimately much of that is similar to what was done in Self over twenty years ago.

shellac · on Nov 14, 2014

That's not really true. Java linking happens very late, and so invoking methods at polymorphic call sites does have more in common with 'dynamic' languages than you might think. Hotspot has smalltalk roots, and features like inline caching at call sites together with optimistic fast paths, guards, deoptimisation, and other things that would be familiar from, say, a ruby implementation.

gsnedders · on Nov 16, 2014

Yes — HotSpot does much of this, but the inability until recently to encode "call this function which expects some number of arguments" in the instruction set has historically been the limiting factor. invokedynamic helps a lot, but it adds enough levels of indirection that AIUI often enough isn't inlined for it to really solve everything. I wonder if a tracing JVM would handle it better? Probably.

vanderZwan · on Nov 13, 2014

> There is an inherent overhead to all dynamically typed languages like Ruby and Python, method dispatch

I can't claim to fully understand how they pull it off (not being that familiar with compiler internals), but I thought Julia didn't suffer from this problem?

ihnorton · on Nov 13, 2014

Julia is really designed to feel dynamic while minimizing the run-time dynamicity the compiler must account for. Type inference and method specialization play a big role, and you can read the Julia papers (on julialang.org) for a discussion of many of the design choices.

Many modern JIT compilers use multiple techniques for aggressive specialization of code sections that meet optimization heuristics (whether tracing-based or otherwise), thus carefully written JavaScript (under V8 and others), Lua (under LuaJIT) and Python (under PyPy and others) can be quite fast. On these platforms there are usually implicit or explicit language subsets and coding styles required to get maximum performance out of the compiler (as with Julia: it is possible to write slow code in any language). For example, ASM.js is an optimization-friendly, explicit subset of JavaScript.

chrisseaton · on Nov 13, 2014

Most of the problems with these languages were pretty much solved several decades ago - it's just that most language implementations don't use the techniques as they need a JIT that supports dynamic deoptimization [1], which is non-trivial to implement.

[1] U. Hölzle, C. Chambers, and D. Ungar, “Debugging optimized code with dynamic deoptimization,” presented at the PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, New York, New York, USA, 1992, pp. 32–43.

jrochkind1 · on Nov 13, 2014

Doesn't ObjC do similar method dispatch? If so, is it considered a slow language also?

pjmlp · on Nov 13, 2014

Yes.

It is quite common to cache methods (selectors) or just re-write them in the C subset, when optimizing code.

orionblastar · on Nov 13, 2014

I used to program in Visual BASIC 6.0 which was considered slow at the time. It is just a matter of using the proper algorithms to speed it up. Almost any programming language targeted at beginners or newbies is going to be considered slow because beginners and newbies don't know the proper algorithms and code tweeks to speed things up yet.

When you deal with sloppy inefficient code, almost any language is going to be considered slow.

I remember trying to make a paint program with Turbo Pascal 3.0 back in the DOS days, it was slow until I learned better algorithms in drawing things and how to write directly to video memory instead of using built in commands like plot.

perishabledave · on Nov 13, 2014

I don't think Ruby is targeted at beginners or newbies if thats what you're implying. The language seems simple at first, but once you start delving into the features that make it a dynamic language it is quite deep.

Rapzid · on Nov 14, 2014

This stuff is always fun : https://github.com/opscode/chef/blob/master/lib/chef/node/im... .

pgz · on Nov 13, 2014

Great article! What I don't understand is why can't they make the compiler inline the function instead inlining manually?

timr · on Nov 13, 2014

This is more than just C function inlining -- the Ruby interpreter, when it does a method call, ends up doing a lot of expensive bookkeeping at runtime to maintain its own internal call stack.

This optimization allows the interpreter to completely bypass a Ruby method call, which is a big win.

pgz · on Nov 13, 2014

I meant having something like:

if (check_cfunc(ci->me, rb_obj_equal)) { return rb_obj_equal(recv, obj); }

and trusting the compiler to inline rb_obj_equal (or using inline), although I reckon you don't probably want rb_obj_equal to always be inlined.

timr · on Nov 13, 2014

Oh, sorry...wasn't clear.

The problem with just doing the rb_obj_equal call for everything is that if people override eql? in their Ruby code, you need to call that overridden method, instead.

The consequences of this decision mean that the comparison operator needs to be more complex.

masklinn · on Nov 13, 2014

Well inlining this way is an optimisation which the compiler may or may not opt to apply, even after adding inlining hints. By manually inlining it they're introducing a little bit of trivial redundancy but I guess they can be certain there will be no indirect call.

Either that, or they never considered the compiler would most likely inline it.

Oddly enough, they do it for string comparisons: https://www.omniref.com/ruby/2.1.4/files/vm_insnhelper.c#lin...

saraid216 · on Nov 13, 2014

FYI, putting two spaces ahead of each line in a code block will wrap it in a pre-formatted tag. In case you didn't know.

gsg · on Nov 13, 2014

They aren't inlining manually. The optimisation works by making more cases fall into an already existing fast path.

edit: I might have misunderstood - I should say that the patch doesn't do any inlining. The fast path is indeed the manually inlined body of the function.

mrinterweb · on Nov 13, 2014

My impression was that ruby symbols were defined once and mapped to the same memory space. If I am correct in my assumption, wouldn't it be more efficient to pass the arguments in as pointers and compare the equality of the pointers?

djur · on Nov 13, 2014

That's actually what they're doing. VALUE is a typedef for uintptr_t, so rb_obj_equal is just testing whether the pointers are the same.

mrinterweb · on Nov 13, 2014

Thanks for looking into that, Matt. It has been a long time since I did any real C programming, and I could not believe that they would be passing by value for this kind of equality check. The word "VALUE" is a little deceiving considering that it is actually a uintptr_t.

stormbrew · on Nov 14, 2014

It is uintptr_t in size, but it's actually a bit more complicated than that. If the 'value' is a fixnum, the low bit is set to 1 and the bottom 31 or 63 bits of the integer are shifted to the left and stored directly in the VALUE [1].

And then there's a few other bit patterns (all with the low bit set to zero) that also mean embedded values [2], including symbols, where they're actually an index into a string table rather than an actual object pointer.

So a lot of the time, VALUE is anything but a misnomer. Also, the name for this pattern is tagged pointer. It's one of the most notable things ruby borrowed from emacs lisp [3].

Also, the entire point of symbols is that they don't need to be dereferenced (or strcmp'd) to compare them. That's not the slow part of symbol comparison.

[1] https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#...

[2] https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#...

[3] https://www.gnu.org/software/emacs/manual/html_node/elisp/Ob...

riffraff · on Nov 14, 2014

it's not, really, VALUE may alternatively contain an immediate value or a pointer, i.e. a couple bits are used for type tagging and the rest is used to hold an int/float/nil/boolean. and avoid an extra struct.

cliffordheath · on Nov 19, 2014

It's not the comparison that's slow, it's the method lookup. String equality bypasses method lookup, so the slightly slower comparison is irrelevant. Also, very small strings are inlined into the object and will have been loaded along with the cache line that accessed the object, so the string bytes themselves are already in cache - and the slow comparison doesn't matter much.

pmontra · on Nov 13, 2014

Almost offtopic because not directly related to symbols performances in Ruby but, coincidence, I was reading a 2011 post from Rubinius site http://rubini.us/2011/02/17/rubinius-what-s-next/ a couple of minutes before seeing this on HN.

At almost 50% of the post there is a code snippet that explains why Ruby method calls must be slower than the ones of statically typed languages. It also explains how Rubinius was addressing that 3 years ago.

Hopefully they'll optimize that in all language implementations.

donpdonp · on Nov 13, 2014

also almost off topic but the lead Rubinius dev just posted a great writeup on rubinius 3.0 which could be a possible future of ruby in general.

http://rubini.us/2014/11/12/rubinius-3-0-part-3-the-instruct...

pmontra · on Nov 13, 2014

Interesting read. Thank you.

skratlo · on Nov 13, 2014

Makes me wonder why the C compiler (gcc?) isn't clever enough here and is not inlining all that non-sense. At least I would expect SufficientlySmartCompiler™ to do so. Doing the inline by hand just proves how dark is the age we're living in.

masklinn · on Nov 13, 2014

> Makes me wonder why the C compiler (gcc?) isn't clever enough here and is not inlining all that non-sense.

There's an indirection through a function pointer (ci->me). The manual optimisation is there to see whether the function pointer matches a baseline pointer, and apply that baseline directly instead of invoking the "generic" dispatch machinery (which involves setting new callframes & al)

If you look at the whole function, there's a few special case before that which statically dispatch the case of a comparison between numbers or strings, so here we're in the "general" case and one last optimisation available is to see whether equality has been overridden at all.

chrisseaton · on Nov 13, 2014

People are often confused about how a JIT can be faster than a static compiler - this is a great example of why that can be the case. A dynamic compiler would be able to speculatively inline through the function pointer, where in a static compiler that is not tractable with what we currently know about compilers.

gsg · on Nov 13, 2014

That's not really true - it's just (possibly partial) defunctionalisation. The problem isn't that we don't know how to do it, but that the necessary whole program architecture has various drawbacks.

See Stalin and MLton for examples of a static compiler performing such analyses.

chrisseaton · on Nov 13, 2014

Consider the case of a program which applies a processing function to pixels in an image. Which processing function to run depends a command line parameter. How would whole program analysis help you know which function you are going to use? But a JIT will see you keep calling the same function and inline it. Not even profile directed feedback will help you if each time you run the program you use a different function.

I know Stalin and MLton but not the research you mention - can you point me at any papers?

gsg · on Nov 14, 2014

It's true that whole program compilation doesn't cover speculation (and many other cases of dynamicism, like running code that you download or construct at runtime). But it does allow inlining through a function pointer as in the OP, which you suggested is impossible for a static compiler.

The classic paper on defunctionalisation is Reynolds' "Definitional Interpreters for Higher-Order Programming Languages". There's also a huge whack of papers at http://mlton.org/References, some of which go into MLton's compilation strategy (I don't remember which ones to point you at, though).

syncsynchalt · on Nov 13, 2014

There's no inline being done in this fix. The equality method has a check for a particular function pointer, and if found executes a fast path.

By using a #define to alias another name to that same function pointer, it hits the same fast path.

The equality method can't be inlined because ruby is a dynamic language and eql? can be redefined.

cliffordheath · on Nov 19, 2014

"Symbols are this unique, quasi-string construct" - not even a little bit unique. Interned atoms were a core feature in the very first Lisp interpreters, and the idea has been copied in many languages since.