Implementing a fast interpreter without resorting to assembly

fexl · on May 27, 2011

Brilliant. Fascinating how he slides from switch-dispatching to function pointers, deals with the problems there, then does that thing where the function pointer returns the next function pointer. But that's just skimming the surface of the brilliance there. The intricacies of his observations have a sparkling beauty.

kabdib · on May 28, 2011

I've worked with him. He's a smart guy. :-)

haberman · on May 28, 2011

> and thus debunk one of the greatest virtual machine myths of all - the blind faith in the use of jitting as a performance accelerator.

It's not blind faith, it's demonstrated in pretty much every situation where an interpreter and a JIT are both available for the same language. All of the fastest VMs out there today (V8, LuaJIT, JVM, etc.) are based on a JIT.

> The performance level made possible by the Nostradamus Distributor also means that [interpreter-based] Bochs could purely interpreted pretty much match the performance of [JIT-based] QEMU, although I admit that is still left to actually implement and try out.

This sounds like blind faith to me.

sb · on May 28, 2011

Just for the record, while I agree at least in principle regarding your assessment of interpreters vs. JIT compilers, the situation seems to be far from clear though, and I think the last word is not yet spoken on that topic.

As far as I am concerned, most of the benchmark suites out there give an unfair advantage to JIT compilers. For example, all numeric JavaScript and Python benchmarks can be heavily optimized by JIT compilers (essentially removing all of their interpreters' weaknesses: (un-)boxing, dynamic typing, and in the case of Python, reference counting; plus removing the interpreter overhead [i.e., instruction dispatching]). Many of the benchmarks are numerical in nature, too, even if the actual workload is usually non-numerical. So it might very well be that your actual workload does not use any of the fancy numerical operations that a JIT can optimize heavily. In such a case, the additional memory consumption of the code caches and the additional memory requirements of a generational garbage collector may in fact not give you any practical speedups in comparison to a sophisticated interpreter using just reference counting (which is known to be very space efficient).

Aside of this unfair skewing of benchmarks towards numerical computations, there are other points to consider in the discussion of JIT vs. interpreters, such as energy consumption. Does a programming language implementation using a JIT subsystem require more or less memory than an interpreter? (I am positive that some companies have already measured this, but there are AFAIK no publications concerning this important question.)

Summing up, I think--as is so often the case in computer science--which of the two techniques give the best results depends heavily on a) what trade-offs you/your customers are willing to make [space vs. time] and b) the actual performance is of your workload [numerical vs. non-numerical].

sb · on May 28, 2011

EDIT: Of course, the question in the third paragraph should read:

Does a programming language implementation using a JIT subsystem require more or less energy than an interpreter?

sb · on May 28, 2011

I have just read the post and have to say that I would appreciate the post containing a more accurate description of the testing methodology. Some of the techniques described are known in interpreter optimization and his description of "threaded code" intepreter is actually not consistent with what it usually means (viz. threaded code interpreters have nothing to do with "threading" the decode for the successor instruction into the operation implementation of the current instruction, but just moving instruction decode and dispatch into the operation implementation.)

Aside of that, there have been papers detailing the software pipelining approach (cf. Hoogerbrugge et al.: "A code compression system based on pipelined interpreters", 1999), but I cannot for the love of god imagine that the loop shown in there is faster than a JIT compiler for a very simple reason: each of the interpreter operations is invoked via a function pointer, which means that the compiler emits an indirect branch instruction for implementing this. Now, these indirect branch instructions are very costly, and a simple JIT compiler could just take the actual values (callee target addresses) of the input program and generate direct call instructions instead. (And I am not even talking about inlining the function bodies of the operation implementation.)

dfox · on May 28, 2011

Indirect branch is not too costly when it is correctly predicted and due to proliferation of C++ code, modern CPUs are very good at predicting indirect branches when they mostly lead to same target.

radarsat1 · on May 28, 2011

I'm not sure I understand the attraction of writing a fast interpreter, as opposed to writing a JIT or ahead-of-time compiler. If you're going to the trouble of optimizing the hell out of your interpreter, yet compiling will always result in faster execution time, why not go directly from an easy, one-off slow implementation to writing a compiler, instead of spending your time optimizing the execution loop?

Retric · on May 28, 2011

Writing an interpreter can be a lot simpler than coding an equivalent compiler. Suppose you have an old game you want to play that was written in hand crafted 8080 ASM. You can emulate an old CPU at reasonable speeds and get a complete working solution. But, if uses say self modifying code there really is no way to directly map 8080 to x86(64) instructions.

But, a more common case is it let's you debug systems. Even if you can't run an iPhone app in full speed on your desktop you can inspect what's going on in detain with an emulator in a way you can't really do with compiled code.

beagle3 · on May 28, 2011

A fast interpreter is not an easy task, but it is still much easier than a jit or aot compiler that provides similar performance.

Case in point - LuaJIT2's interpreter is faster than LuaJIT1 compiler (LuaJIT2's compiler is significantly better than either) - and the interpreter is much simpler, although neither is in no way simple.

Writing a good compiler for dynamic language is extremely hard - PyPy took 6 years to reach speed parity with the Python interpreter (and has been improving since), Python2C and UnladenSwallow didn't manage to improve on it in any significant way. V8 / JaegerMonkey are super-complex JITs that seem - at this point in time - to be comparable with the performance of LuaJIT2's interpreter.

Both LuaJITs were written by Mike Pall, single handedly as far as I can tell.

If you read Mike Pall (mikemike on reddit, and whatever he puts up on the lua-list and luajit.org), you'll appreciate interpreters more.

DarkShikari · on May 28, 2011

Writing a JIT is often very difficult, and requires making many assumptions (particularly regarding self-modifying code, in the case of emulators). Some languages and architectures are inherently difficult or inefficient to JIT.

sb · on May 28, 2011

At the time of writing, some of the other commentors have already provided answers regarding the ease of implementation aspect of interpreters. Another important advantage of implementing an interpreter is that it usually gives you portability for free (or with modest extra works).

The nice point of figuring out optimization of interpreters is that you get the speedup on all architectures, while in a JIT you usually have to change the backend (or worse if you have to create a new backend from scratch.)

edit: I am sorry, another commenter already hinted at the portability aspect.

Zaak · on May 28, 2011

No need to write a separate back end for each target platform.

dexen · on May 27, 2011

The content of the website is beyond awesome.

http://www.emulators.com/docs/nx33_qemu_0125.htm first caugth my eye.

then http://www.emulators.com/docs/nx30_bochs245.htm

wolfgangK · on June 4, 2011

Too bad I'm coming late to the party, but those interested by this post might be interested by the test implementations that I made a few months ago to bench various design parameters of interpreters (bytecode size, gCC computed goto extension...) in high level (I.e boost::variant<>) C++. I don't have the numbers at hand but I remember it was very CPU dependant. You can try for yourself, it is only a few self contained small programs : https://github.com/scientific-coder/Computer-Languages/tree/...