As the author of [1]: thanks for the shout-out! I appreciate it.
And if you’re more interested in the Bytecode/VM than the tree-walking part: I wrote and published a direct sequel to the mentioned book, called Writing A Compiler In Go [0], which — just like the tutorial we’re looking at here — shows how to turn the tree-walking interpreter built in the first book into a bytecode compiler and stack-based virtual machine.
I've always wanted to know the answer from someone who actually implemented a VM: Why do people prefer stack-based VMs over register-based? They are obviously a simpler design, but register-based designs are closer to the actual machine, and I've read somewhere that register-based VMs generally perform better.
In addition to the other answers, "closer to the machine" isn't necessarily true either.
The obvious implementation of a register file for an interpreter is using an array of registers, something like:
uintptr_t registers[N];
You can then access the i-th register as registers[i], like the machine does. Except that these registers live in memory, not in CPU registers, so you aren't really that close after all! And the compiler cannot in general map these array entries to CPU registers because they are accessed indirectly, by index. Also, you typically have a different number of virtual registers from the actual CPU.
So the advantages of register machines do not come from this theoretical closeness to the machine. They can still be better, though, because you eliminate a lot of bytecode for shuffling data around. I think the canonical source for the possible speedups is "Virtual Machine Showdown: Stack vs. Registers": http://www.usenix.org/events/vee05/full_papers/p153-yunhe.pd...
This paper also gives a very nice copy propagation algorithm that eliminates much of the need for a real register allocator, if you start from stack-based input code.
I wrote a longer justification for this in the introduction of the second book, but the short version is exactly what you wrote: they're easier to build. Since since the goal of the book(s) is not to show how to build production-ready, industrial-grade interpreters, but rather to learn and to understand, that was a good enough reason for me.
And thinking about it some more, a register-based bytecode is less flexible and future proof. A JIT can do register allocation according to the register set of the current (runtime!) target, while a register-based bytecode would be fixed, thus requiring register remapping or reallocation by a JIT to better match the actual hardware.
If a JIT compiler can do register allocation, then it can also do this "reallocation", which is the same thing: mapping symbolic (virtual register) names to CPU registers.
> And thinking about it some more, a register-based bytecode is less flexible and future proof.
The flow analysis that modern JITs use to extract an SSA form from stack-based bytecode is just as easy to perform on register-based bytecode. This SSA form is then optimized, instructions selected, and CPU registers allocated.
From one perspective, you have a continuum from (essentially) infinite register machines with SSA (LLVM bitcode, SafeTSA) or memory machines (such as Inferno's Dis VM) to 256-register VMs (LuaJIT, Lua, others) to 16-register VMs (Dalvik, maybe others) to stack machines (JVM, p-code, FCode, etc.). From one perspective, stack machine bytecode is a compact representation of code for a register machine with 2 registers. Register machines aren't any less flexible than stack machines.
I just got myself Writing Interpreters and Compilers for the Raspberry Pi Using Python for Christmas (mainly to learn more about Raspberry Pi assembly). It's very accessible. First half of the book is a breeze, then the difficulty goes up a bit.
Relatively light on theory, which is a conscious choice by the author. The whole approach starts more from the applied angle than the theoretical, with the latter serving the former. It does discuss CFGs, for example.
Welcome! Stumbled across it when someone else here (or perhaps on a programming subreddit) referred to this professor as writing really accessible books.
(BTW, based on other comments I've seen by you before I doubt it will contain a lot of challenging content for you. But perhaps the writing style itself is interesting)
This is super neat!
i'd encourage anyone who liked this to check out dybvig's thesis[1] - super readable. It's neat to see how you can cleave the runtime system and compile time system. The first two chapters, at least, are pretty cool.
I haven't gotten through the whole thing yet but it looks great. The textbook I have now kinda stops at AST generation, so I was assuming the rest would be way more difficult than this. This makes it look manageable!
Since this is the sort of post I like, I have taken to doing technical posts on a topic myself. The thing I have found is just how time consuming it is. Maybe I'm just slow at this.
I wrote it over the course of a three or four sittings, let it sit for a couple of days, then came back and revised it after sending the draft to several people. So in total a week or so?
Great write-up, thank you! My one suggestion is that you drop the enum meta-programming; it is a distraction. Instead do what you yourself propose: use a simple class with int attributes.
Yeah, that is probably the way to go. But you know how it is... I got sucked down a rabbit-hole while trying to figure out the best enum representation in Python.
I went to the same class and I just wrote a bytecode interpreter for work. This is a nice follow up I might contact you when I start work again (still on PTO).
This is an amazingly comprehensive writeup! Very well done. Do you intend on implementing a JIT or an optimizing compiler pass? I'd be interested in reading those posts if you eventually write them.
If/when I get around to it, I would quite like to write posts like that. Also posts that address the ideas at the bottom of this post. Also posts that address some of the comments in this thread, like name resolution.
It's extremely primitive. Far better tutorials and bytecide designs exist, much better than those perl-, php-, python- alike bytecodes. Unfortunately only such primitive designs get written and used.
SICP e.g. starts with this and then goes into its improvements, whilst better lisps or lua start with tighter opcode designs, and do proper closures and lambda lifting. python cannot even do proper lexical blocks/functions, the most important part of a interpreter/compiler design.
Since this is intented to be educational I find it a bit unfortunate to implement function calls using the host (python) stack since it's an important part of an interpreter.
It means that you can't really add things like stack traces, coroutines, exceptions, single stepping, etc and understand how they work under the hood.
Not sure why you say that: I know Chakra uses the native stack for both its interpreter and JIT and it obviously has stack traces and debugging. It's possible to walk the stack pointer chain and mark explicit frames as host or guest frames
This code attempts to create unboundedly many Python stack frames. If those frames were allocated on the heap, you would expect it to either throw an exception or else display symptoms of memory exhaustion. Instead, it segfaults.
The recursion limit exists simply to prevent the user-level code from exhausting the native C stack.
On the other hand CPython in fact does extra work because it also creates heap allocated frame objects (which essentially mirror the C stack) for debugging purposes.
Yes, but the point is that that extra work is still happening on the stack, not on the heap. Overflowing the C stack results in a segfault; exhausting heap memory doesn't.
When it performs a Python call, CPython performs a corresponding recursive call of the interpreter function. It uses both the C stack to manage interpreter state and heap-allocated stack frame objects to manage the interpreted program's state.
Your experiment shows that the C stack overflows faster than the heap. It does not show that no heap space is consumed at all.
Perhaps a topic for future blog post. I considered having virtual frames but thought it would obscure the point of the post, which was the relationship between compiler and interpreter.
Having naming be a bytecode level instruction kinda misses out on a whole level of what makes compiling interesting in my opinion. One of the challenges in compilation is to convert names to offsets representing names. This is an important optimization.
Depends on what you find interesting. Depending upon what you are targeting, the names can be part of the bytecode and the interesting part becomes conforming to their naming standards, preventing clashes, mangling for needed temps for synthetics/spilling/etc, and name resolution even in recursive situations.
> Bytecode interpreters don’t work on the AST directly. They work on preprocessed ASTs. This simplifies the execution model and can produce impressive speed wins.
In comparing a bytecode interpreter with an AST treewalker, and assuming parsing is not counted (i.e. we already have the AST), what makes bytecode interpreter faster?
Walking a tree doesn't seem like that much of a performance penality... perhaps the "execution model" includes something else?
Ast will usually include original/runtime names and need recursion where the bytecode is mostly flat.
For example for an assignment you'll have ast like (assign (foo 5)). This will have to look up "assign", run it, look up "foo" in the frame, and assign 5 to it. All of that will operate on pointers to different parts of the heap and kill your CPU cache too.
With bytecode, a lot of it will be already reduced to (for example) a machine with stack, so the assignment will be closer to: read the op (jump address in an array, so already in cache), dispatch via a small lookup table, put 5 in slot 2 (names already rewritten to stack indices) in the current frame.
Though for looking up operators/keywords etc, preprocessing this step doesn't feel like cheating to me (though strictly, it's something more than an AST).
Probably. One thing with an AST walker is that the "walker" part come for free. Your control flow need minimal to none conversion and could leverage the host for it.
With byte code, some stuff look easier and the performance boost is nice, but need to commit fully to become an "abstract machine". Thats cool if the host language not match the control flow you want, but is extra work otherwise.
A bytecode interpreter has a much more basic execution environment than an uncompiled AST-driven interpreter.
With a naive AST interpreter you create an object for each language construct, which has references to other AST objects in a big tree. Execution involves calling eval() on the root and recursively allowing sub-expressions to bubble up their evaluated results. So this approach is nice because execution and code look similar, but it is also bloated since you create an object for each construct which is thrown away after calling eval() and it is succeptible to stack overflow.
With a bytecode interpreter you simplify the execution environment to a list of instruction codes, an env, and a self-maintained call stack. Each op is a simple manipulation of the stack or env, with little wasted effort.
Overall, compiling to byte code is equivalent to memoizing the AST to remove recursion from evaluation. It is much more likely to be cache coherent as well.
> Walking a tree doesn't seem like that much of a performance penality...
It really is. At the level of a language interpreter, cycles count and things like cache misses become important. Bytecode keeps the instructions densely-packed in memory and the dispatching code fairly densely packed in memory too.
With a tree-walker, you are dereferencing pointers, skipping around in memory, and blowing the cache frequently. If you're using the Visitor or Interpreter pattern, you have the overhead of looking up the vtable, finding the method pointer, and invoking it. There's the overhead of the function call itself too -- saving and restoring registers, etc. It all adds up quickly.
Oh nice. I have been writing a BASIC bytecode interpreter (and a compiler for WebAssembly)[1] in Rust.
It's a bit slower to get rolling than Python - especially function value representation and dynamic dispatch, unlike Python, could not just make a callable object on the fly. Handling function/procedure calls in a stack VM is definitely a fun exercise, and there seems to be many ways to skin the cat. Shared vs. dedicated data stack across functions, adding registers (unlimited number if required) etc.
I've followed along with the craftinginterpreters book lately, and it's been a blast. My version of clox [1] has multi-threading with a GVL, managed GC heaps, and various bytecode optimization passes. I've looked into the Ruby source code and even understand much of it now :) Future plans are to add a generational copying GC and remove the GVL ;)
> Bytecode interpreters don’t work on the AST directly. They work on preprocessed ASTs. This simplifies the execution model and can produce impressive speed wins.
Just executing off the parser is also a strategy. That's what some people[1] do, since you can get even more impressive speed wins by simply making small/dense programs.
[1] https://interpreterbook.com/
[2] http://craftinginterpreters.com/