VLIW works _really_ well on a deterministic ISA. As in when you know how long a ...

zasdffaa · on Nov 15, 2022

> VLIW works _really_ well on a deterministic ISA

that's a very good point.

> Otherwise I really like the GPU derivative form where N instructions are ready to go and the hardware picks one where the prerequisite memory access has completed, despite that being variable

Hmm, isn't that OOO (out of order) execution? And isn't that something GPUs explicitly don't do? (AFAIUI)

JonChesterfield · on Nov 16, 2022

I'm probably describing it poorly. A GPU (at least the AMD ones) has N execution units (of a few different types) and M threads (wavefronts) waiting to be scheduled onto them. Each cycle (ish), hardware picks one of the M threads, runs it for an instruction, suspends it again. Memory stalls / waits are handled by not scheduling the corresponding thread until it's ready. The operation might be a CAS over pcie so it's really variable in latency.

Provided you have enough (ideally independent) work item / task / coroutine things available to enqueue, and you're designing for throughput instead of latency, that keeps the execution units or memory bandwidth fully occupied. Thus win.

The x64 style model is deep pipeline + branch predict / speculation and it all goes slow when the guess fails. In contrast the GPU doesn't have to worry about guessing a branch destination incorrectly and unwinding wasted work. It does have to worry about having enough work queued up, and distribution of tasks across the compute unit / symmetric multiprocessor units, and the hardware-picks-the-order leaks into the concurrency model a bit.

zasdffaa · on Nov 16, 2022

Nice clear answer, thanks.

kaba0 · on Nov 17, 2022

Wouldn’t this whole problem be solved by a graph-based architecture, used already by some compilers as an intermediate? Then the CPU sees where and when everything is needed and schedule the loading of registers as optimally as it is feasible, at runtime.