This looks like a fun project, but claiming that it “beats luajit” is very premature based on the benchmarks listed.
The README lists only an allocation stress test (fast allocs are good, to be sure) and recursive Fibonacci (always good fun to try but doesn’t resemble real usage).
Looks like it doesn’t have a JIT compiler, and without that I’d put a lot of money against it ever being as fast as LuaJIT, V8, the JVM, etc etc in the general case.
Edit to add: on reading more closely, there’s more benchmark-gaming going on here than I realised. On the second benchmark, it says:
As you can see, minivm (no JIT) is a hair slower than Node JS (JIT) but beats luajit --joff by a fair margin (no JIT).
They already noted that Node has a longer startup time, so that obviously dominates in a benchmark that’s only a few milliseconds long; and disabling the JIT in LuaJIT doesn’t tell us much.
Personally, I find it pretty impressive that it performs as well as these runtimes despite not having a JIT compiler. I'm pretty sure Shaw's written more benchmarks, but as the README explains, it's really hard to tell what the performance characteristics of a language are without writing a larger application. So far the largest applications written with MiniVM are Paka[0], a self-hosted language similar to Lua that targets MiniVM; os49[1], an operating system built on Paka/MiniVM in the spirit of lisp machines; and xori[2], an online playground for the language.
Edit: to address your edit, the the Tree benchmark starts small, but grows exponentially (the graph is logarithmic). The final runs of the benchmark take about 15 seconds to run, and are run 10 times per language, at which point startup time is no longer a large factor. The fib test compares both luajit with the JIT on and JIT off: as MiniVM is not JIT'd, I think that this is a fair comparison to make.
I just don’t think you should claim it’s very fast unless you can show that it actually is very fast. The current benchmarks look like toy benchmarks. (I don’t mean that in a derogatory way -- they’re exactly the kind of tiny programs you need to build and test while bringing up a new VM.)
It's very clear that microbenchmarks don't tell the whole story, and that whole application benchmarks to give more real results will be desirable later.
I mean, I agree these benchmarks don't mean that much except for being a useful way to verify that the core code is working and reasonably performant - but they're up front about that fact so calling it 'benchmark-gaming' seems a trifle unfair.
MiniVM will JIT eventually. Right now the biggest barrier is the VM snapshots. Currently MiniVM its own stack and heap and can snapshot them from any point in the program.
One thing to note about MiniVM is that it has very strong types. V8 has many steps to go through when it needs to perform something like addition (weak typing). MiniVM supports only 4 basic types.
MiniVM's reference frontend Paka is fully self hosted, It is currently the only large MiniVM program.
MiniVM uses coroutines already, every 1000 branches the vm will return to the scheduler. The Build system can accept CFLAGS+=-DVM_BRANCH_DEFER which makes MiniVM able to run part of bytecode up-to N instructions.
It is not exposed on the insruction level yet tho. Would be quite easy to add.
Ah thank you, that's very promising! It was the kind of thing I wanted to hear, I write Lua every day and four data types didn't appear to support 'thread'.
I might just kick the tires on this over the weekend.
It might be interesting to be able to tell the VM to resume into code for exactly N instructions or branches. I can see instrumentation/debugging tooling using this with N=1 for example.
A snapshot is a the entire state of a program at a single moment in time. Continuations are basically exposed snapshots, i.e. taking a snapshot, storing it in a variable, doing some work, and then 'calling' the snapshot to return to an earlier point. Continuations allow you to implement a naive version of single-shot delimited continuations - coroutines! This can be very useful for modeling concurrency.
Aside from coroutines and continuations, snapshots are neat for distributed computing: spin up a vm, take a snapshot, and replicate it over the network. You could also send snapshots of different tasks to other computers to execute. In the context of edge computing, you could snapshot the program once it's 'warm' to cut back on VM startup time.
Snapshots allow you to peek into your program. Imagine a debugger that takes snapshots on breakpoints, lets you to inspect the stack and heap, and replay the program forward from a given point in a deterministic manner. You could also send a snapshot to a friend so they can run an application from a given point on their machine. If you do snapshots + live reloading there are tons of other things you can do (e.g. live patching and replaying of functions while debugging).
(Apart from that, I’d note that Glulx and the Z-machine are both terrific examples of detailed VM specifications. Glulx is impressive because it was built by one person, the Z-machine possibly even more impressive as it was reverse-engineered by insanely dedicated Infocom fans.)
FWIW, I understand that LuaJIT gets some of its insane real-world performance from a JIT and VM design that's effectively shrink-wrapped around Lua semantics/intrinsics - it's not general-purpose. I've read (unfortunately don't remember exactly where) that people have tried to run off with the JIT and VM and use it in other projects, but never succeeded because of the tight coupling.
In the same way, while a bespoke 64/32-bit x86+ARM JIT would be a reasonable undertaking, it could make for a pretty interesting target with a potentially wide-ranging set of uses.
For example, it could be the VM+JIT combination that all those people trying to dissect LuaJIT were looking for :).
I could see something like this becoming an attractive option for games that want an exceptionally simple runtime. Sort of like a scaled-down NekoVM (a la Haxe).
Broadly speaking, I get the (potentially incorrect) impression (from a very naive/inexperienced POV) that MiniVM+JIT would be looking to close a more widely-scoped, higher-level loop out of the box than something like libFirm would be (despite the very close correlation in functionality when looking at libFirm). So it'd be closer to Cling (https://root.cern/cling/) than raw LLVM, perhaps (albeit with 0.01% of the code size :D). It is for this reason that I kind of pause for a minute and ponder that a fully integrated JIT could be a pretty good idea. It would absolutely make the irreducible complexity of the project balloon, with reasonable motivation likely necessary to maintain cohesion.
On a different note, If I were to backseat-drive for a minute :) the first thing I'd rant about is how attractive modern JITs need trivial ways to verify code correctness, both online (how exactly was *this* specific generated code constructed - so, straightforward logging) but also (and particularly) offline (humans staring at the JIT source code and mentally stepping through its behavior - and succeeding (miracles!!) because the code is small and comprehensibly architected). If the JIT was implemented in such a straightforward manner, end users wanting to run potentially malicious user-supplied code with high performance in potentially security-sensitive settings might be attracted to the project. (Mike Pall made bank for a while while CloudFlare was using LuaJIT for its WAF*... ahem...)
I came across this reference of how to break out of LuaJIT 2.1 (2015) a while back: https://www.corsix.org/content/malicious-luajit-bytecode - and every time I take a look at the code I switch away from the tab :) (and sometimes step away from the computer for a minute :D). It's solely a demonstration of "this is how it would work", and clarifies that LuaJIT makes no sandbox guarantees about the code it executes, but reading through it, the amount of Stuff™ going on represents a surface area that to me (naively) seems just... like LuaJIT as a whole is generally too large to easily reason about from a security standpoint (oh yeah, besides being written in assembly language...). This might be inexperience speaking, but I can't help but wonder whether a smaller, simpler implementation might be able to implement a secure JIT; for all I know this might be an impossible P=NP pipe dream I haven't fully grasped yet, I guess what I'm trying to figure out is whether "small enough for non-100x programmers to mentally reason through" and "large enough to do a few practical things quickly" have practical overlap?
[* CloudFlare only ran internally-generated code through LuaJIT, for completeness. A VM+JIT that could safely run untrusted code would thus be even more interesting than LuaJIT was to CloudFlare, from a general perspective. ]
---
Something completely different that I randomly discovered recently and which I thought I'd idly mention is that JITs might seem to have a bit of a hard time on Android in certain obscure circumstances. I can't (yet) tell if this is LuaJIT-specific or "anything that's a JIT"-specific: KOReader (Android eBook reader, implemented entirely using LuaJIT - https://github.com/koreader/android-luajit-launcher) has a bunch of very scary magic Things™ it seems to need to do to make LuaJIT even work at all on Android (https://github.com/koreader/android-luajit-launcher/blob/mas...), due to a apparently-current issue causing issues across different domains (https://github.com/LuaJIT/LuaJIT/issues/285), which has been apparently cropping up going back years (https://www.freelists.org/post/luajit/Android-performance-dr... (2013)). KOReader has even nontrivially patched LuaJIT's C code in places (https://github.com/koreader/android-luajit-launcher/blob/bb0...) with purposes I am yet to fully understand (it might just be for debugging). I happened to be considering idly playing around with Lua on Android (currently fascinated with interpreted/JITed runtimes) and after stumbling on this I'm debating whether to use Lua instead, ha. I've been meaning to ask around on the LuaJIT list and wherever KOReader discusses stuff to learn more, after spending the time actually getting LuaJIT linked into an Android project and poking around. Haven't got to it yet. This could be an absolutely massive red herring that I'm going deer-in-headlights about because it just looks off-in-the-weeds, or potentially significant. It might also be absolutely irrelevant to not-LuaJIT pursuits, but as I noted I'm not (yet) experienced enough to tell.
Fibonacci is the poster child of hotspots. A JIT should be crushing an interpreter at Fibonacci as it is simple to generate good code for. Yes, it's just a microbenchmark; so you shouldn't make SPEC type conclusions from it. But you shouldn't dismiss it either.
4984, there is a roadmap, but I couldn't find info on strategic goals of this project or where it supposed to be a best fit. Can you please share your views on this?
Also, is there an API to connect to to emit VM bytecodes? Or an instruction set reference.
I'm asking out of pure fan-like interest in this field and because I see lack of simple-enough (for me) tools for language building. Don't take me serious or anything like that, I'm not a pro. I tried to dig to llvm few years ago, but so c++ much boilerplate, meh. I know I can just transpile to Lua or js with sourcemaps, but it's zero fun. Projects like this remind me of 8080 asm times.
There is no roadmap. There are no written long term goals. This project came out of need for a faster virtual machine for my Paka language. There is an api to emit vm bytecode, but it's not in C, it is from Paka.
Roadmap could go like this tho: More Core Opcodes (Bit Ops, Closures), Split overloaded opcodes. More frontends. Optimizer, Dynamic Compiler and JIT.
This is a nice Forth-style bytecode interpreter, that is easy to generate code for. Close to WASM, but much simpler.
The thing that attracts me most is that the bytecode format can be easily translated to a more compact format, allowing it serve as an intermediate format for _many_ homegrown instruction sets.
So a GCC or LLVM backend needs to be on the roadmap. Code generation is a blast. Only the number of registers should be user-configurable to make this useful.
In what way is this forth-like? The opcodes take immediates and registers as arguments, whereas forth-like would be much more stack-based than this, right?
The Readme mentions that they've experimented with a Lua frontend. I can't find any reference to this in the repo. Is this a private experiment, or is it available somewhere?
Thanks for pointing this out, I'll adjust the title to make it more accurate. I think the benchmarks show that the runtime is at least on par with existing popular runtimes.
Some other cool pitch points I've learned while chatting with Shaw:
- MiniVM supports saving and loading compressed snapshots/images, like smalltalk
- JIT is in the works (it had a JIT in the past, which was removed in a refactor)
- Zero dependencies, compiles to Wasm with cosmo libc.
- VM supports multiple languages and sharing libraries between languages
- The main language that targets MiniVM is self-hosted on MiniVM; there's a minimal os built on MiniVM.
- Fast startup times make it good for embedded scripting, like lua.
- the way the ISA itself is also really interesting, would recommend reading the C if that's your thing.
The reason MiniVM is so benchmark-oriented in its presentation is that MiniVM is ver benchmark-oriented in my workflow.
I have tried to find a faster interpreter without JIT. It is hard to find benchmarks that don't just say that MiniVM is 2-10x faster than the interpreted languages.
Yeah, it’s a bit disappointing how slow most non-JIT interpreters are -- I think JIT understandably took the wind out of their sails.
The state of the art as I understand it (but maybe you’ve figured out something faster?) is threaded interpreters. I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.
> I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.
Fortunately the Forth community write about their techniques, so you don't have to just eyeball gforth's source-code. A handful of links for those interested:
So I was talking about threaded code with Shaw the other day, and he gave me a long explanation about the various tradeoffs being made and the resulting techniques used in MiniVM. As I understand it, MiniVM is basically threaded code, with cached opcodes. Essentially, it uses precomputed gotos:
This optimization makes MiniVM a tad faster than raw computed gotos, because it's essentially threaded code with some optimizations. He can probably expand on this; if you want to hear the full really interesting explanation yourself, he's pretty active in the Paka Discord[0].
That feels like the sort of optimisation that wins in small cases and loses in larger ones. The jump array is going to be 4 or 8 times the size of the opcode array so will put a lot more pressure on the caches as your programs get larger.
Years ago I wrote a small toy interpreter based on predereferenced computed gotos with surprisingly good speed. For very small programs without dynamic data allocation churn it could run as fast as 3 times slower than compiled code instead of the expected 10X slower. Good things happen speed-wise when all the byte code and data fits into the L1 cache. Also, branch prediction is near perfect in tight loops.
Binary trees is a classic GC benchmark - ideally a GC'd language should be able to do better than a malloc/free implementation in C (tree.c) since it can release the whole tree in bulk (like what the ptree.c does).
Also note that tree.c speeds up substantially if you link with a better malloc like jemalloc or tcmalloc.
head_of_loop:
load_var x
push_int 1000
less_than
jump_if_false :end_of_loop
load_var x
increment
store_var x
end_of_loop:
I'd more expect a stack engine to be something like something like -
load_var x
head_of_loop:
jump_lt_keep 1000 :end_of_loop
increment
end_of_loop:
store_var x
in the sense that anything that can figure out how to populate and use r0 would likely also be able to figure out it can wrap the load/store calls around the outside of the loop.
(I'm far from an expert on stack versus register VMs though so I may be missing several somethings here)
Well, sure, but I was thinking along the lines of "reserving a spot on the stack and reserving a register seem like about the same level of clever", especially thinking about how mixed register/stack argument passing works in even relatively stupid C compilers.
Eliding the entire loop seems relatively similarly clever for either architecture, but would've counted as cheating in the frame I was pondering from.
Still neat to see even if x86 asm continues to make my eyes cross ;)
The main interpreter loop looks a lot like python’s. It uses the same pre processor tricks and jump tables to help both the reader and the compiler to make sense of the flow. It’s also near that integers are unboxed, at the expensive of their top two bits being reserved as a type hint.
Not only the reader and the compiler, that kind of threaded code really helps performance thanks to leveraging the CPU's branch target buffer to obtain great branch prediction of interpreted code, and that is one of the killer aspects to get current CPUs to perform well.
Hard to tell whether it supports proper tail calls or not. The readme mentions Scheme, so it's not impossible I suppose! One of the nice things about wasm is the formal semantics; a project like this could benefit from one of those, too.
Author here, Tail calls can be implemented in MiniVM bytecode. There was once an instruction that performed a tail call, It bit-rotted due to not being used.
MiniVM is moving very quickly currently. A lot still needs to be done, like JIT and optimizers.
Eventually a formal specification of MiniVM should happen, just not yet as things are changing too fast.
>Built on a register-based ISA that beats luajit—with the JIT on—in some benchmarks
This is interesting, considering that LuaJIT in interpreter mode is famous for being faster than simple JITs. Worth looking into what the differences are.
Looks like an interesting project, but I can't find any documentation on how to actually use it (or on the paka language either). Is it just still in early stages and documentation hasn't been written yet?
I've written some documentation for it in the past, but had to scrap it because things tend to move so quickly that it's better to read the code and ask Shaw than use incorrect documentation. As things stabilize, these issues will be resolved, but it's very bleeding-edge right now.
The README lists only an allocation stress test (fast allocs are good, to be sure) and recursive Fibonacci (always good fun to try but doesn’t resemble real usage).
Looks like it doesn’t have a JIT compiler, and without that I’d put a lot of money against it ever being as fast as LuaJIT, V8, the JVM, etc etc in the general case.
Edit to add: on reading more closely, there’s more benchmark-gaming going on here than I realised. On the second benchmark, it says:
As you can see, minivm (no JIT) is a hair slower than Node JS (JIT) but beats luajit --joff by a fair margin (no JIT).
They already noted that Node has a longer startup time, so that obviously dominates in a benchmark that’s only a few milliseconds long; and disabling the JIT in LuaJIT doesn’t tell us much.