The "everyone gets a stack" implementation feels awkward and limiting.
One of the first non-trivial simulators that I wrote was an implementation of Icon's control structures using heap-allocated frames.
While heap allocating a frame is more expensive than incrementing/decrementing a stack pointer, managing lots of heap-allocated execution contexts is faster than managing lots of execution stacks, especially in a 32 bit world.
It had an ordinary stack for each thread, which was used mainly for running libc stuff.
One of the first non-trivial simulators that I wrote was an implementation of Icon's control structures using heap-allocated frames.
While heap allocating a frame is more expensive than incrementing/decrementing a stack pointer, managing lots of heap-allocated execution contexts is faster than managing lots of execution stacks, especially in a 32 bit world.
It had an ordinary stack for each thread, which was used mainly for running libc stuff.