Hacker News new | past | comments | ask | show | jobs | submit login
MiniVM: A zero-dependency cross-language runtime on par with LuaJIT and C (github.com/fastvm)
240 points by isaacimagine on Jan 8, 2022 | hide | past | favorite | 57 comments



This looks like a fun project, but claiming that it “beats luajit” is very premature based on the benchmarks listed.

The README lists only an allocation stress test (fast allocs are good, to be sure) and recursive Fibonacci (always good fun to try but doesn’t resemble real usage).

Looks like it doesn’t have a JIT compiler, and without that I’d put a lot of money against it ever being as fast as LuaJIT, V8, the JVM, etc etc in the general case.

Edit to add: on reading more closely, there’s more benchmark-gaming going on here than I realised. On the second benchmark, it says:

As you can see, minivm (no JIT) is a hair slower than Node JS (JIT) but beats luajit --joff by a fair margin (no JIT).

They already noted that Node has a longer startup time, so that obviously dominates in a benchmark that’s only a few milliseconds long; and disabling the JIT in LuaJIT doesn’t tell us much.


Personally, I find it pretty impressive that it performs as well as these runtimes despite not having a JIT compiler. I'm pretty sure Shaw's written more benchmarks, but as the README explains, it's really hard to tell what the performance characteristics of a language are without writing a larger application. So far the largest applications written with MiniVM are Paka[0], a self-hosted language similar to Lua that targets MiniVM; os49[1], an operating system built on Paka/MiniVM in the spirit of lisp machines; and xori[2], an online playground for the language.

Edit: to address your edit, the the Tree benchmark starts small, but grows exponentially (the graph is logarithmic). The final runs of the benchmark take about 15 seconds to run, and are run 10 times per language, at which point startup time is no longer a large factor. The fib test compares both luajit with the JIT on and JIT off: as MiniVM is not JIT'd, I think that this is a fair comparison to make.

[0]: https://github.com/FastVM/paka

[1]: https://github.com/ShawSumma/os49

[2]: https://fastvm.github.io/xori


Sure, it’s a nice piece of work!

I just don’t think you should claim it’s very fast unless you can show that it actually is very fast. The current benchmarks look like toy benchmarks. (I don’t mean that in a derogatory way -- they’re exactly the kind of tiny programs you need to build and test while bringing up a new VM.)


The disclaimer at the top of the benchmark section - https://github.com/FastVM/minivm#benchmarks - seemed pretty reasonable to me.

It's very clear that microbenchmarks don't tell the whole story, and that whole application benchmarks to give more real results will be desirable later.

I mean, I agree these benchmarks don't mean that much except for being a useful way to verify that the core code is working and reasonably performant - but they're up front about that fact so calling it 'benchmark-gaming' seems a trifle unfair.


MiniVM will JIT eventually. Right now the biggest barrier is the VM snapshots. Currently MiniVM its own stack and heap and can snapshot them from any point in the program.

One thing to note about MiniVM is that it has very strong types. V8 has many steps to go through when it needs to perform something like addition (weak typing). MiniVM supports only 4 basic types.

MiniVM's reference frontend Paka is fully self hosted, It is currently the only large MiniVM program.


That sounds like a solid basis for adding asymmetric coroutines as well, is that something you're thinking about adding?

I like the philosophy and gumption shown by this project but coroutines aren't a feature I'd lightly give up.

Digging what I see so far!


MiniVM uses coroutines already, every 1000 branches the vm will return to the scheduler. The Build system can accept CFLAGS+=-DVM_BRANCH_DEFER which makes MiniVM able to run part of bytecode up-to N instructions.

It is not exposed on the insruction level yet tho. Would be quite easy to add.


Ah thank you, that's very promising! It was the kind of thing I wanted to hear, I write Lua every day and four data types didn't appear to support 'thread'.

I might just kick the tires on this over the weekend.


That’s cool. Have you played around with that number? I’m curious how you pick that constant and what the trade offs are


It might be interesting to be able to tell the VM to resume into code for exactly N instructions or branches. I can see instrumentation/debugging tooling using this with N=1 for example.


Question on VM snapshotting: what’s the purpose/point in even having such an ability? What does it allow you to do?

I only know of snapshotting perhaps being necessary to support coroutine based context switching.

Thanks and very cool project!


A snapshot is a the entire state of a program at a single moment in time. Continuations are basically exposed snapshots, i.e. taking a snapshot, storing it in a variable, doing some work, and then 'calling' the snapshot to return to an earlier point. Continuations allow you to implement a naive version of single-shot delimited continuations - coroutines! This can be very useful for modeling concurrency.

Aside from coroutines and continuations, snapshots are neat for distributed computing: spin up a vm, take a snapshot, and replicate it over the network. You could also send snapshots of different tasks to other computers to execute. In the context of edge computing, you could snapshot the program once it's 'warm' to cut back on VM startup time.

Snapshots allow you to peek into your program. Imagine a debugger that takes snapshots on breakpoints, lets you to inspect the stack and heap, and replay the program forward from a given point in a deterministic manner. You could also send a snapshot to a friend so they can run an application from a given point on their machine. If you do snapshots + live reloading there are tons of other things you can do (e.g. live patching and replaying of functions while debugging).


To do these kinds of things on Linux checkout https://www.criu.org/Main_Page

Checkpoint and Restart In Userland


One nifty thing you can do with snapshots in games is use it for save/restore/undo.

See for example the Z-machine used for interactive fiction, and its younger cousin Glulx: https://www.eblong.com/zarf/glulx/Glulx-Spec.html#saveformat

(Apart from that, I’d note that Glulx and the Z-machine are both terrific examples of detailed VM specifications. Glulx is impressive because it was built by one person, the Z-machine possibly even more impressive as it was reverse-engineered by insanely dedicated Infocom fans.)


Out of curiosity, are you planning on (progressively, slowly) rolling your own JIT, or using something like DynASM (https://luajit.org/dynasm_features.html), libFirm (https://pp.ipd.kit.edu/firm/Features.html), or some other preexisting thing (eg https://github.com/wdv4758h/awesome-jit) in the space?

FWIW, I understand that LuaJIT gets some of its insane real-world performance from a JIT and VM design that's effectively shrink-wrapped around Lua semantics/intrinsics - it's not general-purpose. I've read (unfortunately don't remember exactly where) that people have tried to run off with the JIT and VM and use it in other projects, but never succeeded because of the tight coupling.

In the same way, while a bespoke 64/32-bit x86+ARM JIT would be a reasonable undertaking, it could make for a pretty interesting target with a potentially wide-ranging set of uses.

For example, it could be the VM+JIT combination that all those people trying to dissect LuaJIT were looking for :).

I could see something like this becoming an attractive option for games that want an exceptionally simple runtime. Sort of like a scaled-down NekoVM (a la Haxe).

Broadly speaking, I get the (potentially incorrect) impression (from a very naive/inexperienced POV) that MiniVM+JIT would be looking to close a more widely-scoped, higher-level loop out of the box than something like libFirm would be (despite the very close correlation in functionality when looking at libFirm). So it'd be closer to Cling (https://root.cern/cling/) than raw LLVM, perhaps (albeit with 0.01% of the code size :D). It is for this reason that I kind of pause for a minute and ponder that a fully integrated JIT could be a pretty good idea. It would absolutely make the irreducible complexity of the project balloon, with reasonable motivation likely necessary to maintain cohesion.

On a different note, If I were to backseat-drive for a minute :) the first thing I'd rant about is how attractive modern JITs need trivial ways to verify code correctness, both online (how exactly was *this* specific generated code constructed - so, straightforward logging) but also (and particularly) offline (humans staring at the JIT source code and mentally stepping through its behavior - and succeeding (miracles!!) because the code is small and comprehensibly architected). If the JIT was implemented in such a straightforward manner, end users wanting to run potentially malicious user-supplied code with high performance in potentially security-sensitive settings might be attracted to the project. (Mike Pall made bank for a while while CloudFlare was using LuaJIT for its WAF*... ahem...)

I came across this reference of how to break out of LuaJIT 2.1 (2015) a while back: https://www.corsix.org/content/malicious-luajit-bytecode - and every time I take a look at the code I switch away from the tab :) (and sometimes step away from the computer for a minute :D). It's solely a demonstration of "this is how it would work", and clarifies that LuaJIT makes no sandbox guarantees about the code it executes, but reading through it, the amount of Stuff™ going on represents a surface area that to me (naively) seems just... like LuaJIT as a whole is generally too large to easily reason about from a security standpoint (oh yeah, besides being written in assembly language...). This might be inexperience speaking, but I can't help but wonder whether a smaller, simpler implementation might be able to implement a secure JIT; for all I know this might be an impossible P=NP pipe dream I haven't fully grasped yet, I guess what I'm trying to figure out is whether "small enough for non-100x programmers to mentally reason through" and "large enough to do a few practical things quickly" have practical overlap?

[* CloudFlare only ran internally-generated code through LuaJIT, for completeness. A VM+JIT that could safely run untrusted code would thus be even more interesting than LuaJIT was to CloudFlare, from a general perspective. ]

---

Something completely different that I randomly discovered recently and which I thought I'd idly mention is that JITs might seem to have a bit of a hard time on Android in certain obscure circumstances. I can't (yet) tell if this is LuaJIT-specific or "anything that's a JIT"-specific: KOReader (Android eBook reader, implemented entirely using LuaJIT - https://github.com/koreader/android-luajit-launcher) has a bunch of very scary magic Things™ it seems to need to do to make LuaJIT even work at all on Android (https://github.com/koreader/android-luajit-launcher/blob/mas...), due to a apparently-current issue causing issues across different domains (https://github.com/LuaJIT/LuaJIT/issues/285), which has been apparently cropping up going back years (https://www.freelists.org/post/luajit/Android-performance-dr... (2013)). KOReader has even nontrivially patched LuaJIT's C code in places (https://github.com/koreader/android-luajit-launcher/blob/bb0...) with purposes I am yet to fully understand (it might just be for debugging). I happened to be considering idly playing around with Lua on Android (currently fascinated with interpreted/JITed runtimes) and after stumbling on this I'm debating whether to use Lua instead, ha. I've been meaning to ask around on the LuaJIT list and wherever KOReader discusses stuff to learn more, after spending the time actually getting LuaJIT linked into an Android project and poking around. Haven't got to it yet. This could be an absolutely massive red herring that I'm going deer-in-headlights about because it just looks off-in-the-weeds, or potentially significant. It might also be absolutely irrelevant to not-LuaJIT pursuits, but as I noted I'm not (yet) experienced enough to tell.


Fibonacci is the poster child of hotspots. A JIT should be crushing an interpreter at Fibonacci as it is simple to generate good code for. Yes, it's just a microbenchmark; so you shouldn't make SPEC type conclusions from it. But you shouldn't dismiss it either.


4984, there is a roadmap, but I couldn't find info on strategic goals of this project or where it supposed to be a best fit. Can you please share your views on this?

Also, is there an API to connect to to emit VM bytecodes? Or an instruction set reference.

I'm asking out of pure fan-like interest in this field and because I see lack of simple-enough (for me) tools for language building. Don't take me serious or anything like that, I'm not a pro. I tried to dig to llvm few years ago, but so c++ much boilerplate, meh. I know I can just transpile to Lua or js with sourcemaps, but it's zero fun. Projects like this remind me of 8080 asm times.


There is no roadmap. There are no written long term goals. This project came out of need for a faster virtual machine for my Paka language. There is an api to emit vm bytecode, but it's not in C, it is from Paka.

Roadmap could go like this tho: More Core Opcodes (Bit Ops, Closures), Split overloaded opcodes. More frontends. Optimizer, Dynamic Compiler and JIT.


This is a nice Forth-style bytecode interpreter, that is easy to generate code for. Close to WASM, but much simpler.

The thing that attracts me most is that the bytecode format can be easily translated to a more compact format, allowing it serve as an intermediate format for _many_ homegrown instruction sets.

So a GCC or LLVM backend needs to be on the roadmap. Code generation is a blast. Only the number of registers should be user-configurable to make this useful.


In what way is this forth-like? The opcodes take immediates and registers as arguments, whereas forth-like would be much more stack-based than this, right?


The bytecodes aren't but the interpreter is, look up threaded code.



The Readme mentions that they've experimented with a Lua frontend. I can't find any reference to this in the repo. Is this a private experiment, or is it available somewhere?


How does this compare to the techniques used in the wasm3 (name, not description)?

https://github.com/wasm3/wasm3

architecture of M3 the interpreter core of wasm3 https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md

https://github.com/wasm3/wasm3/blob/main/docs/Performance.md


I have looked into Meta Machines. They are not what I want.

Meta Machines can be almost as fast, but i just prefer the cache locality of local functions.

Wasm3 is slower than MiniVM when you don't count the GC. Wasm3 is very cool!


How do the architectures of the two interpreters differ?

Not trying to challenge your system but learn.

Can Wasm be translated to run on MiniVM? If MiniVM is pure C it can run on Wasm.


Anything beats something on some benchmarks. That's not a very captivating pitch point.

Good stuff otherwise. Very nice project.


Thanks for pointing this out, I'll adjust the title to make it more accurate. I think the benchmarks show that the runtime is at least on par with existing popular runtimes.

Some other cool pitch points I've learned while chatting with Shaw:

- MiniVM supports saving and loading compressed snapshots/images, like smalltalk

- JIT is in the works (it had a JIT in the past, which was removed in a refactor)

- Zero dependencies, compiles to Wasm with cosmo libc.

- VM supports multiple languages and sharing libraries between languages

- The main language that targets MiniVM is self-hosted on MiniVM; there's a minimal os built on MiniVM.

- Fast startup times make it good for embedded scripting, like lua.

- the way the ISA itself is also really interesting, would recommend reading the C if that's your thing.


Thank you for having a look.

The reason MiniVM is so benchmark-oriented in its presentation is that MiniVM is ver benchmark-oriented in my workflow.

I have tried to find a faster interpreter without JIT. It is hard to find benchmarks that don't just say that MiniVM is 2-10x faster than the interpreted languages.


Yeah, it’s a bit disappointing how slow most non-JIT interpreters are -- I think JIT understandably took the wind out of their sails.

The state of the art as I understand it (but maybe you’ve figured out something faster?) is threaded interpreters. I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.


> I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.

Fortunately the Forth community write about their techniques, so you don't have to just eyeball gforth's source-code. A handful of links for those interested:

https://www.complang.tuwien.ac.at/papers/

https://github.com/ForthPapersMirror/Papers

https://www.complang.tuwien.ac.at/projects/forth.html

https://www.complang.tuwien.ac.at/anton/euroforth/

http://soton.mpeforth.com/flag/jfar/vol7.html

http://soton.mpeforth.com/flag/fd/

http://www.forth.org/literature.html

http://www.forth.org/fd/contents.html


So I was talking about threaded code with Shaw the other day, and he gave me a long explanation about the various tradeoffs being made and the resulting techniques used in MiniVM. As I understand it, MiniVM is basically threaded code, with cached opcodes. Essentially, it uses precomputed gotos:

    // classical computed goto
    goto *x[y[z]]
    // MiniVM precomputed goto
    goto *a[z]
This optimization makes MiniVM a tad faster than raw computed gotos, because it's essentially threaded code with some optimizations. He can probably expand on this; if you want to hear the full really interesting explanation yourself, he's pretty active in the Paka Discord[0].

[0]: https://discord.gg/UyvxuC5W5q


That feels like the sort of optimisation that wins in small cases and loses in larger ones. The jump array is going to be 4 or 8 times the size of the opcode array so will put a lot more pressure on the caches as your programs get larger.


Years ago I wrote a small toy interpreter based on predereferenced computed gotos with surprisingly good speed. For very small programs without dynamic data allocation churn it could run as fast as 3 times slower than compiled code instead of the expected 10X slower. Good things happen speed-wise when all the byte code and data fits into the L1 cache. Also, branch prediction is near perfect in tight loops.


Sounds like a great test case for using a tensorflow-lite model to switch between both techniques.


Binary trees is a classic GC benchmark - ideally a GC'd language should be able to do better than a malloc/free implementation in C (tree.c) since it can release the whole tree in bulk (like what the ptree.c does).

Also note that tree.c speeds up substantially if you link with a better malloc like jemalloc or tcmalloc.


Is there any documentation on the bytecode? I would be interested to see their design choices.


Check the code, there's a big list of opcodes with pretty descriptive names. The code is effectively the documentation for now.


This example seems a bit pessimised:

  head_of_loop:
    load_var x
    push_int 1000
    less_than
    jump_if_false :end_of_loop
    load_var x
    increment
    store_var x
  end_of_loop:
I'd more expect a stack engine to be something like something like -

  load_var x
  head_of_loop:
    jump_lt_keep 1000 :end_of_loop
    increment
  end_of_loop:
  store_var x
in the sense that anything that can figure out how to populate and use r0 would likely also be able to figure out it can wrap the load/store calls around the outside of the loop.

(I'm far from an expert on stack versus register VMs though so I may be missing several somethings here)


That would be a nice optimization. Here's what Python does

              0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 (1000)
              4 COMPARE_OP               0 (<)
              6 POP_JUMP_IF_FALSE       18

              8 LOAD_FAST                0 (x)
             10 LOAD_CONST               2 (1)
             12 BINARY_ADD
             14 STORE_FAST               0 (x)
             16 JUMP_ABSOLUTE            0
             18 LOAD_CONST               0 (None)
Python's VM is very slow, but note only one arg per op and the compare + jump are not fused.

BTW, if we're considering "advanced" optimization might as well elide the whole loop :) EG https://godbolt.org/z/n3GznqKsM

    test(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        jmp     .L2
    .L3:
        add     DWORD PTR [rbp-4], 1
    .L2:
        cmp     DWORD PTR [rbp-4], 999
        jle     .L3
        nop
        pop     rbp
        ret


Well, sure, but I was thinking along the lines of "reserving a spot on the stack and reserving a register seem like about the same level of clever", especially thinking about how mixed register/stack argument passing works in even relatively stupid C compilers.

Eliding the entire loop seems relatively similarly clever for either architecture, but would've counted as cheating in the frame I was pondering from.

Still neat to see even if x86 asm continues to make my eyes cross ;)


(Why is there a "nop" near the end? What does it do?)


The main interpreter loop looks a lot like python’s. It uses the same pre processor tricks and jump tables to help both the reader and the compiler to make sense of the flow. It’s also near that integers are unboxed, at the expensive of their top two bits being reserved as a type hint.


Not only the reader and the compiler, that kind of threaded code really helps performance thanks to leveraging the CPU's branch target buffer to obtain great branch prediction of interpreted code, and that is one of the killer aspects to get current CPUs to perform well.


Hard to tell whether it supports proper tail calls or not. The readme mentions Scheme, so it's not impossible I suppose! One of the nice things about wasm is the formal semantics; a project like this could benefit from one of those, too.


Author here, Tail calls can be implemented in MiniVM bytecode. There was once an instruction that performed a tail call, It bit-rotted due to not being used.

MiniVM is moving very quickly currently. A lot still needs to be done, like JIT and optimizers.

Eventually a formal specification of MiniVM should happen, just not yet as things are changing too fast.


Speaking of wasm, any idea how feasible it would be to convert wasm to minivm bytecode and use it as a wasm runner?


>Built on a register-based ISA that beats luajit—with the JIT on—in some benchmarks

This is interesting, considering that LuaJIT in interpreter mode is famous for being faster than simple JITs. Worth looking into what the differences are.


From a discussion with Shaw:

- Simpler more consistent type system

- MiniVM has more larger instructions to take advantage of

- MiniVM precomputes and caches jumps where possible

- Calling functions is more efficient, less overhead

- Has a custom GC/allocator which allows for some optimizations

- Fixed-sized arrays vs Lua's dynamic tables

- Not a JIT, so faster startup times (no tracing)


Looks like an interesting project, but I can't find any documentation on how to actually use it (or on the paka language either). Is it just still in early stages and documentation hasn't been written yet?


I've written some documentation for it in the past, but had to scrap it because things tend to move so quickly that it's better to read the code and ask Shaw than use incorrect documentation. As things stabilize, these issues will be resolved, but it's very bleeding-edge right now.


How are strings represented in the VM? Is there a benchmark for string operations?


Strings appear to be represented as arrays with elements which are 4 bytes in size, with 30 numeric bits for each character:

https://github.com/FastVM/minivm/blob/4906256b2dc16e227cad7e...

https://github.com/FastVM/minivm/blob/4906256b2dc16e227cad7e...


Wouldn’t a templated interpreter be perhaps faster (eg. what is used by the JVM)?


Does it support any kind of debugging?


How does this compare to MIR?


Mir has a better compiler, and a jit.

This has a better op and data layout, and a fast interpreter.

Apples and oranges




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: