Most of the crazy parts of CPUs (out-of-order, speculative execution, branch pre...

fulafel · 2024-08-05T06:06:15 1722837975

Most of those, except cache, are attempts to work around the bottleneck of single flow of control and thus limited available parallelism.

tsimionescu · 2024-08-05T07:04:05 1722841445

Unfortunately all experience shows that both programmers and compilers are much worse at parallelizing code than CPUs are. There have been many attempts at architectures that allowed compilers or programmers to express code in more parallel ways (VLIW architectures, Intel's Itanium, IBM's Cell used in the PS3) and they all categorically failed.

Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.

adrian_b · 2024-08-05T09:33:20 1722850400

CUDA is quite different. In CUDA you do not express parallel code in a linear fashion, a.k.a. sequential fashion, hoping that CUDA will determine the dependence chains and extract them from the sequential code and execute them in parallel.

The main feature of CUDA is that in order to describe an algorithm that is applied to an array, you just write the code that applies the algorithm to an element of that array, like writing only the body of a "for" loop.

Then the CUDA run-time will take care of creating an appropriate number of execution threads, taking into account the physical configuration of the available GPU, e.g. how many array elements can be processed by a single instruction, how many execution threads can share a single processing core, how many processing cores exist in the GPU, and so on. When there are more array elements than the GPU resources can process at once, CUDA will add some appropriate looping construct, to ensure the processing of the entire array.

The CUDA programmer writes the code that processes a single element array, informing thus the CUDA run-time that this code is independent of its replicas that process any other array element, except when the programmer references explicitly other array elements, which is normally avoided.

The task of a CPU able to do non-sequential instruction execution (a.k.a. out-of-order execution) and simultaneous initiation of multiple instructions (a.k.a. superscalar instruction execution) is quite different.

The main problem for such a CPU is the determination of the dependence relationships between instructions, based on examining the register numbers encoded in the instructions. Based on the detected dependencies and on the availability of the operands, the CPU can schedule the execution of the instructions, in parallel over multiple execution units.

There exists an open research problem, whether there is any better way to pass the information about the dependencies between instructions from the compiler, which already knows them, to the CPU that runs the machine code, otherwise than by using fake register numbers, whose purpose is only to express the dependencies, and which must be replaced for execution with the real numbers of the physical registers, by the renaming mechanism of the CPU.

tsimionescu · 2024-08-05T11:47:04 1722858424

Agreed, CUDA is very different. But still, it's another example where programmers just write sequential single flow of execution code, and it gets executed in parallel according to "external" rules.

The_Colonel · 2024-08-05T08:26:40 1722846400

Out-of-order execution doesn't imply parallelism, it was designed from the beginning to work around the input data availability problem.

In speculative execution and branch predictors, prefetch may seem just as a nice bonus, but given that nowadays CPU performance is largely bottlenecked by memory access, prefetch resulting from these techniques will often come out as the dominant performance factor.

tsimionescu · 2024-08-05T11:57:26 1722859046

It's still a form of parallelism, that could in principle be written into the program instead of being automatically implemented by the processor.

For example, in hand crafted assembly programs, it's sometimes common to know how long a fetch operation lasts, and manually schedule operations such that they can be executed in parallel with the fetch operation.

Theoretically a high level language could also be designed to expose this kind of logic to the programmer. A program in such a language would be expressed as a set of very very short threads that can be interleaved by the compiler given precise knowledge of instruction timers.

fulafel · 2024-08-06T03:51:38 1722916298

OoO came about after multi-issue architectures were starved for instructions to execute due to on-chip blockers like execution unit availability, data hazards, pipeline bubbles, register availability, branching, cache ports. You can call those input data availability problems but it's not availability from offchip memory. So in actual history, yes it was for parallelism (keeping multiple execution units busy).

OoO did have the side benefit from possibly executing past a few memory stalls but those were secondary. OoO reodering resources were sized for addressing the stalls from on-chip timescale things. Today the resources are bigger, but even bigger is the relative memory latency (how many insns you could execute in the time it takes to service a main memory fetch that your pipeline depends on).

_nalply · 2024-08-05T06:41:51 1722840111

Would it make sense to compress code and data with something like zstd and let the CPU decompress?

Of course this means a large change of how computers work but perhaps it is possible to make this opt-in (i.e. backwards compatible) for software?

tsimionescu · 2024-08-05T07:08:06 1722841686

This would make memory read performance much, much more unpredictable, so it is a no-go from the start. And beyond that, the problem is not one of bandwidth, it is one of latency. This would increase bandwidth sometimes, but it would increase latency always, which is a terrible trade-off. Missed branch predictions would cost even more than they do today, for example.

The_Colonel · 2024-08-05T08:06:11 1722845171

This idea isn't about compressing in-flight, but in the instruction cache, so that more instructions will fit into the cache, and you don't need to fetch as often (and thus incur latency) from main memory / L2.

Zstd is impractical, but I can imagine some sort of storage efficient microcode? (current Intel CPUs store original x86 instructions in the L1 instruction cache).

simiones · 2024-08-05T11:10:24 1722856224

You then need extra memory to store the de-compressed instructions, since you can't predict before running the decompression how many instructions you'll get. And you still the problem of an unpredictably-sized instruction cache.

Plus, the larger the instruction cache is, the worse every branch mis-prediction is. As far as I know, the size of the instruction cache is not really limited because of space issues, it's limited for precisely this reason.

nullc · 2024-08-05T22:20:35 1722896435

Some instruction sets are variable length, there are also things like thumb and the 'compressed' riscv extension.

If you compress each instruction one at a time into a variable number of bits the ability to jump to any instruction is preserved but compression is hurt by not being able to use cross-instruction conditional probabilities and having to pack things into integer numbers of bits.

One could imagine a compression format that had 'jump points' -- compiler selection locations where it was possible to jump and decode from, so that you'd only take the above losses at potential jump targets.

You could go further an imagine the instruction set having some kind of constrained short jump forward a limited distance (by simply letting the decoder decode that way) or that can go back a limited distance without using a jump point so long as there was no way for controlflow to change out from under it.

I wonder what percentage of instructions would need to be jump points in such a system?

But I think this is mostly of academic interest: except in very small embedded devices code size is not a huge performance driver and like anything else you get diminishing returns from further optimizations. Variable length multiple-of-bytes instruction sets like x86 probably get a significant fraction of the potential compression gains and do so without a lot of complexity.

pkhuong · 2024-08-05T17:02:49 1722877369

x86 is a variable-length encoding where more frequently used instructions tend to have shorter encodings. Thumb doesn't go as far, with only 2 and 4 -byte instructions. You can find old papers on Huffman encoding instructions.

More effective block compression schemes are harder to pull off because of branches.

adgjlsfhk1 · 2024-08-05T16:18:05 1722874685

The problem here is branches. since you can jump to pretty much any instruction, every instruction needs to be a valid place to start decoding which makes the idea non-tenable.