4984's comments

4984 · on Jan 9, 2023

The benchmarks were run on MacOS, and actually execute an interrupt for debugging, MacOS then checks if the process is being debugged. Wasm3 just exit(1) and prints a message.

And as to why the rest are faster, I spent much time optimizing the interpreter and learning what the best way to write interpreters is. Its mostly jump threading and Mixed Data.

titzer · on Jan 10, 2023

I found that most Wasm interpreters are not particularly good at calls. Wizard is not as fast as wasm3 or wamr in raw speed, but is much faster on calls, particularly because it does not copy arguments (value stacks can be overlapped). But Wizard's primary motivation is to be memory efficient, so it interprets in-place. It also supports instrumentation.

Nice work!

4984 · on Jan 9, 2023

I made Web49 because there are not many good tools for WebAssembly out there. WABT is close, but the interpreter is too slow and the tools megabytes in size each. Wasm3 is a bit faster but only contains an interpreter, nothing else.

Tooling for WebAssembly is held mostly by the browser vendors. It is such a nice format to work with when one removes all the fluff. WebAssembly tooling should not take seconds to do what should take milliseconds, and it should be able to be used as a library, not just a command line program.

I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly. Web49 started as a way to compile WebAssembly to MiniVM, but soon pivoted into its own Interpreter and tooling. I could not be happier with it in its current form and am excited to see what else It can do, with more work.

haberman · on Jan 9, 2023

> I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly.

I'd be very interested to read more about this. It looks like you are using "one big function" with computed goto (https://github.com/FastVM/Web49/blob/main/src/interp/interp....). My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern (particularly when it comes to register allocation): http://lua-users.org/lists/lua-l/2011-02/msg00742.html

I'm curious how you worked around the problem of poor register allocation in the compiler. I've come to the conclusion that tail calls are the best solution to this problem: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

wahern · on Jan 9, 2023

> that compilers do not do well with this pattern

As compared to hand-written assembly or the tailcall technique you describe. But (for the benefit of onlookers) a threaded switch, especially using (switch-like) computed gotos, is still more performant than a traditional function dispatch table.

Has there been any movement in GCC wrt the tailcalls feature?

One of the limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. Several years ago when exploring this space I discovered a hack, albeit it only works with GCC (IIRC), at least as of ~10 years ago. GCC supports inline function definitions, inline functions have visibility to goto labels (notwithstanding that you're not supposed to make use of them), and most surprisingly GCC also supports attaching __attribute__((constructor)) to inline function definitions. This means you can export a map of goto labels that can be used to initialize VM data structures, permitting (in theory) more efficient direct threading.

The tailcall technique is a much more sane and profitable approach, of course.

JonChesterfield · on Jan 10, 2023

The goto labels can exported much more directly using inline asm. Further, inline asm can now represent control flow, so you can define the labels in inline asm and the computed jump at the end of an opcode. That's pretty robust to compiler transforms. Just looked up an interpreter in that style:

#define LABEL_START(TAG) ns_##TAG : __asm__(".p2align 3\n.Lstart_" #TAG ":" :::)

#define LABEL_END(TAG) __asm__(".Lend_" #TAG ":\n")

#define PROLOGUE(TAG) LABEL_START(TAG); ip++

#define EPILOGUE(TAG) __asm__ goto("\tjmpq %0\n" "\t.Lend_" #TAG ":\n"::"r"((void)decode(ip))::ALL_LABELS())

Followed by opcodes implemented in this fashion:

  {
    PROLOGUE(add);
    {
      apply_opcode_ADD(&s->data_stack);
    }
    EPILOGUE(add);
  }

Because the labels are defined in assembly, not in C, accessing them from outside the function is straightforward. I wrote a whole load of these at some point, there's probably a version of those macros somewhere that compiles to jumps through a C computed goto as well.

naasking · on Jan 9, 2023

> My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern

Note that that message is from twelve years ago. A lot's changed since then, not just in compilers but in CPUs. Branch prediction is a lot better now.

haberman · on Jan 9, 2023

Mike's primary complaint is bad register allocation. It is very important to keep the most important state consistently in registers. In my experience, compilers still struggle to do good register allocation in big and branchy functions.

Even perfect branch prediction cannot solve the problem of unnecessary spills.

naasking · on Jan 9, 2023

Very true. I imagine that grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this). Then you loop within this function while the next instruction is in the same group, and only return to the outer global instruction loop otherwise. If you design the bytecode carefully, you can probably do group checks with a simple bitmask.

10000truths · on Jan 9, 2023

Does providing a hint to the compiler using the register keyword address the issue sufficiently?

haberman · on Jan 9, 2023

No, most compilers ignore the register keyword, see: https://stackoverflow.com/a/10675111

JonChesterfield · on Jan 10, 2023

Nearly. You need register and to also pass them into (potentially no-op) inline asm. `register int v("eax")` iirc, but it's been years since I did this.

The 'register' is indeed largely ignored, but it has the additional somewhat documented meaning of 'when this variable goes into inline asm, it needs to be in that register'. In between asm blocks it can be elsewhere - stack or whatever - but it still gives the regalloc a really clear guide to work from.

lifthrasiir · on Jan 10, 2023

It's `register int v asm("eax")`. However they are very easily elided, especially after higher optimization levels; compilers are very open about this [1].

[1] https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables....

fwsgonzo · on Jan 9, 2023

I read a research paper that proved that the branch prediction issues are non-issue with modern predictors (eg. TTAGE). It is of course true that register spills happen, but it's not bad enough to want to write hand-written assembly. Especially when you simulate AOT-compiled code (eg. RISC-V and WASM), you will already be 3-10x faster than Lua already. For my purposes of using this kind of emulator for scripting, it is already fine.

Throw instruction counting into the mix, and you can even be faster than LuaJIT, although I'm not sure how it manages to screw up the counting so badly. I wrote a little bit about it here: https://medium.com/@fwsgonzo/time-to-first-instruction-53a04...

4984 · on Jan 2, 2023

The interview in question, for those interested: https://odin-lang.org/news/newsletter-2022-12/#interview-wit...

4984 · on Jan 8, 2022

I have looked into Meta Machines. They are not what I want.

Meta Machines can be almost as fast, but i just prefer the cache locality of local functions.

Wasm3 is slower than MiniVM when you don't count the GC. Wasm3 is very cool!

sitkack · on Jan 9, 2022

How do the architectures of the two interpreters differ?

Not trying to challenge your system but learn.

Can Wasm be translated to run on MiniVM? If MiniVM is pure C it can run on Wasm.

4984 · on Jan 8, 2022

There is no roadmap. There are no written long term goals. This project came out of need for a faster virtual machine for my Paka language. There is an api to emit vm bytecode, but it's not in C, it is from Paka.

Roadmap could go like this tho: More Core Opcodes (Bit Ops, Closures), Split overloaded opcodes. More frontends. Optimizer, Dynamic Compiler and JIT.

4984 · on Jan 8, 2022

Author here, Tail calls can be implemented in MiniVM bytecode. There was once an instruction that performed a tail call, It bit-rotted due to not being used.

MiniVM is moving very quickly currently. A lot still needs to be done, like JIT and optimizers.

Eventually a formal specification of MiniVM should happen, just not yet as things are changing too fast.

user-the-name · on Jan 8, 2022

Speaking of wasm, any idea how feasible it would be to convert wasm to minivm bytecode and use it as a wasm runner?

4984 · on Jan 8, 2022

Thank you for having a look.

The reason MiniVM is so benchmark-oriented in its presentation is that MiniVM is ver benchmark-oriented in my workflow.

I have tried to find a faster interpreter without JIT. It is hard to find benchmarks that don't just say that MiniVM is 2-10x faster than the interpreted languages.

iainmerrick · on Jan 8, 2022

Yeah, it’s a bit disappointing how slow most non-JIT interpreters are -- I think JIT understandably took the wind out of their sails.

The state of the art as I understand it (but maybe you’ve figured out something faster?) is threaded interpreters. I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.

MaxBarraclough · on Jan 8, 2022

> I suggest looking at Forth implementations, which is where I think most interpreter innovation has happened.

Fortunately the Forth community write about their techniques, so you don't have to just eyeball gforth's source-code. A handful of links for those interested:

https://www.complang.tuwien.ac.at/papers/

https://github.com/ForthPapersMirror/Papers

https://www.complang.tuwien.ac.at/projects/forth.html

https://www.complang.tuwien.ac.at/anton/euroforth/

http://soton.mpeforth.com/flag/jfar/vol7.html

http://soton.mpeforth.com/flag/fd/

http://www.forth.org/literature.html

http://www.forth.org/fd/contents.html

isaacimagine · on Jan 8, 2022

So I was talking about threaded code with Shaw the other day, and he gave me a long explanation about the various tradeoffs being made and the resulting techniques used in MiniVM. As I understand it, MiniVM is basically threaded code, with cached opcodes. Essentially, it uses precomputed gotos:

    // classical computed goto
    goto *x[y[z]]
    // MiniVM precomputed goto
    goto *a[z]

This optimization makes MiniVM a tad faster than raw computed gotos, because it's essentially threaded code with some optimizations. He can probably expand on this; if you want to hear the full really interesting explanation yourself, he's pretty active in the Paka Discord[0].

[0]: https://discord.gg/UyvxuC5W5q

aardvark179 · on Jan 8, 2022

That feels like the sort of optimisation that wins in small cases and loses in larger ones. The jump array is going to be 4 or 8 times the size of the opcode array so will put a lot more pressure on the caches as your programs get larger.

shuffel · on Jan 9, 2022

Years ago I wrote a small toy interpreter based on predereferenced computed gotos with surprisingly good speed. For very small programs without dynamic data allocation churn it could run as fast as 3 times slower than compiled code instead of the expected 10X slower. Good things happen speed-wise when all the byte code and data fits into the L1 cache. Also, branch prediction is near perfect in tight loops.

sitkack · on Jan 8, 2022

Sounds like a great test case for using a tensorflow-lite model to switch between both techniques.

4984 · on Jan 8, 2022

MiniVM will JIT eventually. Right now the biggest barrier is the VM snapshots. Currently MiniVM its own stack and heap and can snapshot them from any point in the program.

One thing to note about MiniVM is that it has very strong types. V8 has many steps to go through when it needs to perform something like addition (weak typing). MiniVM supports only 4 basic types.

MiniVM's reference frontend Paka is fully self hosted, It is currently the only large MiniVM program.

samatman · on Jan 8, 2022

That sounds like a solid basis for adding asymmetric coroutines as well, is that something you're thinking about adding?

I like the philosophy and gumption shown by this project but coroutines aren't a feature I'd lightly give up.

Digging what I see so far!

4984 · on Jan 8, 2022

MiniVM uses coroutines already, every 1000 branches the vm will return to the scheduler. The Build system can accept CFLAGS+=-DVM_BRANCH_DEFER which makes MiniVM able to run part of bytecode up-to N instructions.

It is not exposed on the insruction level yet tho. Would be quite easy to add.

samatman · on Jan 8, 2022

Ah thank you, that's very promising! It was the kind of thing I wanted to hear, I write Lua every day and four data types didn't appear to support 'thread'.

I might just kick the tires on this over the weekend.

rgrmrts · on Jan 8, 2022

That’s cool. Have you played around with that number? I’m curious how you pick that constant and what the trade offs are

exikyut · on Jan 8, 2022

It might be interesting to be able to tell the VM to resume into code for exactly N instructions or branches. I can see instrumentation/debugging tooling using this with N=1 for example.

deckarep · on Jan 8, 2022

Question on VM snapshotting: what’s the purpose/point in even having such an ability? What does it allow you to do?

I only know of snapshotting perhaps being necessary to support coroutine based context switching.

Thanks and very cool project!

slightknack · on Jan 8, 2022

A snapshot is a the entire state of a program at a single moment in time. Continuations are basically exposed snapshots, i.e. taking a snapshot, storing it in a variable, doing some work, and then 'calling' the snapshot to return to an earlier point. Continuations allow you to implement a naive version of single-shot delimited continuations - coroutines! This can be very useful for modeling concurrency.

Aside from coroutines and continuations, snapshots are neat for distributed computing: spin up a vm, take a snapshot, and replicate it over the network. You could also send snapshots of different tasks to other computers to execute. In the context of edge computing, you could snapshot the program once it's 'warm' to cut back on VM startup time.

Snapshots allow you to peek into your program. Imagine a debugger that takes snapshots on breakpoints, lets you to inspect the stack and heap, and replay the program forward from a given point in a deterministic manner. You could also send a snapshot to a friend so they can run an application from a given point on their machine. If you do snapshots + live reloading there are tons of other things you can do (e.g. live patching and replaying of functions while debugging).

sitkack · on Jan 9, 2022

To do these kinds of things on Linux checkout https://www.criu.org/Main_Page

Checkpoint and Restart In Userland

iainmerrick · on Jan 9, 2022

One nifty thing you can do with snapshots in games is use it for save/restore/undo.

See for example the Z-machine used for interactive fiction, and its younger cousin Glulx: https://www.eblong.com/zarf/glulx/Glulx-Spec.html#saveformat

(Apart from that, I’d note that Glulx and the Z-machine are both terrific examples of detailed VM specifications. Glulx is impressive because it was built by one person, the Z-machine possibly even more impressive as it was reverse-engineered by insanely dedicated Infocom fans.)

exikyut · on Jan 8, 2022

Out of curiosity, are you planning on (progressively, slowly) rolling your own JIT, or using something like DynASM (https://luajit.org/dynasm_features.html), libFirm (https://pp.ipd.kit.edu/firm/Features.html), or some other preexisting thing (eg https://github.com/wdv4758h/awesome-jit) in the space?

FWIW, I understand that LuaJIT gets some of its insane real-world performance from a JIT and VM design that's effectively shrink-wrapped around Lua semantics/intrinsics - it's not general-purpose. I've read (unfortunately don't remember exactly where) that people have tried to run off with the JIT and VM and use it in other projects, but never succeeded because of the tight coupling.

In the same way, while a bespoke 64/32-bit x86+ARM JIT would be a reasonable undertaking, it could make for a pretty interesting target with a potentially wide-ranging set of uses.

For example, it could be the VM+JIT combination that all those people trying to dissect LuaJIT were looking for :).

I could see something like this becoming an attractive option for games that want an exceptionally simple runtime. Sort of like a scaled-down NekoVM (a la Haxe).

Broadly speaking, I get the (potentially incorrect) impression (from a very naive/inexperienced POV) that MiniVM+JIT would be looking to close a more widely-scoped, higher-level loop out of the box than something like libFirm would be (despite the very close correlation in functionality when looking at libFirm). So it'd be closer to Cling (https://root.cern/cling/) than raw LLVM, perhaps (albeit with 0.01% of the code size :D). It is for this reason that I kind of pause for a minute and ponder that a fully integrated JIT could be a pretty good idea. It would absolutely make the irreducible complexity of the project balloon, with reasonable motivation likely necessary to maintain cohesion.

On a different note, If I were to backseat-drive for a minute :) the first thing I'd rant about is how attractive modern JITs need trivial ways to verify code correctness, both online (how exactly was *this* specific generated code constructed - so, straightforward logging) but also (and particularly) offline (humans staring at the JIT source code and mentally stepping through its behavior - and succeeding (miracles!!) because the code is small and comprehensibly architected). If the JIT was implemented in such a straightforward manner, end users wanting to run potentially malicious user-supplied code with high performance in potentially security-sensitive settings might be attracted to the project. (Mike Pall made bank for a while while CloudFlare was using LuaJIT for its WAF*... ahem...)

I came across this reference of how to break out of LuaJIT 2.1 (2015) a while back: https://www.corsix.org/content/malicious-luajit-bytecode - and every time I take a look at the code I switch away from the tab :) (and sometimes step away from the computer for a minute :D). It's solely a demonstration of "this is how it would work", and clarifies that LuaJIT makes no sandbox guarantees about the code it executes, but reading through it, the amount of Stuff™ going on represents a surface area that to me (naively) seems just... like LuaJIT as a whole is generally too large to easily reason about from a security standpoint (oh yeah, besides being written in assembly language...). This might be inexperience speaking, but I can't help but wonder whether a smaller, simpler implementation might be able to implement a secure JIT; for all I know this might be an impossible P=NP pipe dream I haven't fully grasped yet, I guess what I'm trying to figure out is whether "small enough for non-100x programmers to mentally reason through" and "large enough to do a few practical things quickly" have practical overlap?

[* CloudFlare only ran internally-generated code through LuaJIT, for completeness. A VM+JIT that could safely run untrusted code would thus be even more interesting than LuaJIT was to CloudFlare, from a general perspective. ]

---

Something completely different that I randomly discovered recently and which I thought I'd idly mention is that JITs might seem to have a bit of a hard time on Android in certain obscure circumstances. I can't (yet) tell if this is LuaJIT-specific or "anything that's a JIT"-specific: KOReader (Android eBook reader, implemented entirely using LuaJIT - https://github.com/koreader/android-luajit-launcher) has a bunch of very scary magic Things™ it seems to need to do to make LuaJIT even work at all on Android (https://github.com/koreader/android-luajit-launcher/blob/mas...), due to a apparently-current issue causing issues across different domains (https://github.com/LuaJIT/LuaJIT/issues/285), which has been apparently cropping up going back years (https://www.freelists.org/post/luajit/Android-performance-dr... (2013)). KOReader has even nontrivially patched LuaJIT's C code in places (https://github.com/koreader/android-luajit-launcher/blob/bb0...) with purposes I am yet to fully understand (it might just be for debugging). I happened to be considering idly playing around with Lua on Android (currently fascinated with interpreted/JITed runtimes) and after stumbling on this I'm debating whether to use Lua instead, ha. I've been meaning to ask around on the LuaJIT list and wherever KOReader discusses stuff to learn more, after spending the time actually getting LuaJIT linked into an Android project and poking around. Haven't got to it yet. This could be an absolutely massive red herring that I'm going deer-in-headlights about because it just looks off-in-the-weeds, or potentially significant. It might also be absolutely irrelevant to not-LuaJIT pursuits, but as I noted I'm not (yet) experienced enough to tell.