The thing is, if the register machine can have a single getvar instruction which takes immediate operand and a register, why can't the stack machine also have a single instruction which also takes an immediate operand and and pushes onto the stack:
pushvar "a"
pushvar "b"
add
popvar "c"
Now the instruction stream looks denser: just four instructions, and none of them reference a register, so they can have a smaller encoding, perhaps.
Vice versa, the register machine's instruction stream density can be puffed up if we don't have a getvar instruction that takes an immediate operand, but only a register pair:
It's like the classic RISC vs CISC debate. The example could turn into a single instruction:
addvar "a", "b", "c"
I think a hybrid stack/register architecture gives the best code density - something like the x87, where you can choose to access the registers like a stack or by index directly. Look at some of the sub-256-byte demos for a good example of x87 code density.
> instruction stream density can be puffed up if we don't have a getvar instruction that takes an immediate operand, but only a register pair
It's hard to imagine this being a concern. If you are in a position to choose whether to use a register or stack VM, you are certainly in a position to add such an instruction (assuming the encoding allows it, etc).
You make a very astute point. I'd like to expand on this a bit.
The key thing to note between traditional stack-ops and register-ops is that opcode space. Stack machines are typically designed with a small number of primitive ops that operate on the top of the stack. For example, adding two local vars and storing them into a local var might look like:
In source code:
x = x + y;
In stack ops:
getlocal 0
getlocal 1
add
setlocal 0
In register ops:
add r0, r1, r0
The key thing to note is that the "add" instruction in the register machine encodes a lot more information. Given a register pool of 32, for example, it encodes 15 bits of extra information (3 5-bit ops) than the "add" instruction in the stack example. The corresponding "getlocal" and "setlocals" are collapsed into the add itself.
As you point out, there's NOTHING stopping a stack machine designer from doing the same thing. It's just traditionally not done, perhaps due to some inclination to keep things "clean and orthogonal".
One could imagine a stack opcode space that provided the following add instructions:
add_ppp - traditional "stack machine" add. Pop, pop, push.
add_lll <R> <L> <X> - extended "locals" add. Load rhs and lhs from locals R and L, store to local X.
add_ppl <X> - hybrid. pop rhs and lhs from stack, store result in local X.
add_ssl <R> <L> <X> - hybrid. treat the stack as a register file indexed from the top. Load rhs and lhs from stack locations R and L, store result into local variable X.
and so on and so on.
One could imagine for every binary operation, there would be a number of variants:
op_IIR
where I could be any of {p, s, l, a, c} for {pop stack, peek stack, load localvar, getarg, constant pool}, and R could be any of {p, l, a, v} for {push stack, store localvar, setarg, and void (discard)}.
I've run measurements of how a hypothetical opcode space would influence the size and number-of-ops for expressions, and basically the result is that it's strictly better than both. It produces smaller code than typical stack machines, and fewer number of instructions than typical register machines.
Basically by inflating the opcode space in a stack machine, and using it to encode operand information, you can achieve the best of both worlds. You get to use the stack as an implicit scratchspace when computing intermediates in expressions, and you get to use your constant pool and locals and arguments directly when computing expression leaves and when storing expression results.
I think where you win with the register machine is in compiling expressions where you have lots of common subexpressions.
Once you have the value of a repeated subexpression in a register, you can just refer to that register.
On a stack machine, your "register" is the top of the stack, and each time you refer to it with a pop, you lose it.
a = expr1 + expr2;
b = expr1 - expr2;
c = f(expr1, expr2);
Register code:
# code to evaluate expr1 to R1
# code to evaluate expr2 to R2
add R5, R1, R2
sub R6, R1, R2
call f # R1-R4 are used are also used for parm passing
store R5, a
store R6, b
store R1, c
Stack:
# code to evaluate expr1 to top of stack
# code to evaluate expr2 to top of stack
# now what? we want to keep expr1 and expr2 in the
# stack and use these values, too, while
# also leaving intermediate
# results in the stack. "dup, rot, swap hell".
The obvious thing is to create compiler-generated temporary local variables, and shunt out of the stack into these. But that's just registers in disguise!
Basically, register files have the undeniable advantage of random access, which works favorably when expressions are DAGs rather than trees.
Stack architectures (at least Forths) typically have multiple stacks, which tend to make both locals and stack shuffling hell unnecessary most of the time. In gforth for example, you can use the return stack to retain temporaries:
: expr1 1 ;
: expr2 2 ;
: f * ;
: test
expr1 expr2 2>r \ retain both results temporarily
2r@ + \ copy both to data stack as needed
2r@ -
2r@ f
2rdrop \ dispose of temporaries
;
test .s \ 3 -1 2
While we're at it, maybe we might relate this to consuming values by using them, and creating explicit copies when we need to consume them more than once. This avoids having to deal with any kind of garbage collection or reference counting, but you have to do the copying of the whole data structure instead of copying the (sometimes/often) smaller pointers to the data itself (random access).
There was some work on this by someone, called something like Linear Lisp. It looked like Forth with parenthesis.
I'm not an expert on interpreters, but from the description it sounds like register machines are very CPU cache unfriendly, whereas stacks by their nature are very cache friendly by their nature. (It's a complicated subject, but the basic idea is: contiguous memory is cache friendly, random access is not). Given how catastrophic cache misses can be for performance (LOTS of CPU stalling), I imagine the instruction density really doesn't mean much compared to keeping the CPU fed with instructions.
The part of all this that is hard is that, when you're working at this abstraction level, you're not writing for an abstract machine, you're writing for real computers that are highly pipelined and heavily dependent on relatively small caches.
Register machines have no particular problems with cache due to random access. The typical arrangement is that there are instructions that allocate and deallocate a block of N contiguous slots from a stack, and "registers" refer to storage within the currently allocated block. This isn't any worse than a stack machine.
There is a real issue with instruction density, however, in that the interpreter will touch many cache lines containing instructions in an unpredictable order. An inefficient encoding will result in overhead due to cache lines being taken away from program data to serve the interpreter.
- a stack machine that JITs to registers (JVM, C#)
- a register machine whose registers match that of the target architecture (I think Lua is or was an example here)
- a register machine with infinitely many registers that JITs to one whose registers match that of the target architecture (LLVM). I think the call is still out here whether it is possible to make a truly fast JITter here, but I expect that is just a matter of lack of effort.
Problem with the middle approach is that a portable JITter that wants to use the full power of a CPU must handle both the case where the hardware has fewer registers than the VM and the case where it has more. The other two only have to handle one of these cases. That's why they are more popular for multi-platform solutions.
And of course, there is the case of JITting to stack CPUs. For practical purposes, one can mostly ignored that case nowadays.
I'm not sure it makes much difference for JIT as by the time it's gone into the JIT's IR and had multiple optimisation
passes applied to it I woukd imagine it'll be in either SSA or PDG format and what format the byte codes came in will be long forgotten.
In a VM, be careful about how you define "stack" and "register", because it's really easy to get definitions where most machines are both. Or, in other words, that's one area where the details are much more important than the overall picture.
Anyway, the entire problem comes from the fact that we are currently JITing the same code again and again. That's the best issue to attack. Save your JITed code for improving it later, and the entire discussion goes away. (And yes, that'll probably require some help from the OS)
Vice versa, the register machine's instruction stream density can be puffed up if we don't have a getvar instruction that takes an immediate operand, but only a register pair:
and so on.