Indeed. My personal thoughts is that we should design a CPU where these kinds of...

eklitzke · on Aug 1, 2022

A sufficiently smart compiler can obviously do any optimization that we can think of, but in practice the kind of optimization described in this article is very difficult to actually implement in an optimizing compiler. The CRC algorithm literally operates one bit at a time, in sequential order. As the article mentions, there are some properties of how the CRC math work that allow you to calculate the CRC from multiple chunks in a stream and then combine them in a specific way. This is the kind of thing that an optimizing compiler just isn't going to be able to figure out on its own given a naive CRC implementation because it's a math insight much moreso than it is a code generation trick.

In principle you could probably use some kind of symbolic math solving library to try to detect optimization opportunities like this in general code. In practice it just wouldn't be worth it, even at -O3, because it would add a ton of CPU and memory overhead to compilation and the optimization would be very rarely applicable in the first place.

dragontamer · on Aug 1, 2022

This "run two loops in parallel" pattern is an _incredibly_ common pattern for these "high speed benchmarks". A whole lot of different programs in the high-speed space (this CRC32, the lookup3 Jenkins Hash, My own AES random number generator, etc. etc.).

Instead of the programmer manually thinking of the "two independent pipelines", its quite possible that we can imagine a language where the two pipelines were programmed separately, and then a compiler merges them together knowing about the pipeline details (ex: Skylake is 1-per-clock throughput / 3-clock latency, AMD might be different like 1-per-clock throughput / 4-clock latency or something).

The programmer's job would be to "separate out the two independent threads of execution", while the compiler's job is to "merge them back together into one instruction stream".

Much like how SIMD-code is written today, the programmer is responsible for finding the parallelism. The compiler / machine is responsible for efficient execution.

--------

As it is, today's high speed programmers have to do both tasks simultaneously. We have a mental model of the internal registers, pipelines, throughput, latencies of different execution units, and manually schedule them to match our mental model. (But that mental model changes as new CPUs come out every few years).

The hard part is figuring out the parallelism. I don't think we have a language that describes this fine-grained parallelism though, in any case. Just a "what if we lived in a magically perfect world" kinda hypothetical here...

EDIT: Alternatively, you can "cut dependencies" and hope that the compiler discovers your (intended) low dependency chain. IE: Manually unroll loops and whatever, which works better in practice than you might think (and such manual unrolling often seems to trigger the AVX autovectorizer in my experience, if you're lucky).

retrac · on Aug 1, 2022

> a CPU where these kinds of pipelines / executions are more explicit

A VLIW-style architecture is found on signal processors and other dedicated number-crunchers, where people are actually going to write custom assembly routines fully exploiting the hardware. For example, the original NEC uPD7720 from 1980 has separate fields in its instruction word. You get two ALU operations (where the operands don't depend on each other) and an address operation per instruction, all performed in parallel in one clock cycle.

Intel would later take that idea for the i860 (long before Itanium) and the i860 can run in a mode where it always fetches two instructions per cycle, one integer and one floating point. Code generation with a compiler for this turned out to be very difficult to exploit.

Another ten years later, compiler technology has improved a lot, right? Maybe it'll work now. Cue Itanium. Ah. Maybe not. The idea is probably going to be revisited every decade or so until compiler technology has actually advanced enough to pull it off :P

323 · on Aug 1, 2022

Wasn't this the Gentoo philosophy?

You downloaded the system source code and recompiled everything targeted to your exact CPU specs.

dragontamer · on Aug 1, 2022

Well, its a bit different when you design a CPU around the Gentoo philosophy as opposed to an OS around the Gentoo philosophy.

When a CPU is designed around that, everyone needs to recompile each generation for maximum performance. The theory is that bytecode like Java recompiles efficiently though (or perhaps NVidia PTX bytecode, a SIMD bytecode that recompiles each NVidia GPU generation for a higher performance example with more support from the hardware).

------

Intel Itanium was a 200x era design that was supposed to be like this, but x86-64 from AMD ended up being faster in practice.

NVidia's PTX really changed things, as well as the habit of GPU programmers for writing very, very small "programs" (called kernels) that's managed by a separate chip (IE: cpu manages the kernels, compiles / calls them as appropriate, etc. etc.). It works out in GPU land, but maybe will never work in CPU land (unless CPU-land picks up upon this kernel-invoke abstraction? Intel ispc for example has the model, as does OpenMP target offload... and thread-pools and Go arguably have it as well)

R0b0t1 · on Aug 1, 2022

A CPU built around the Gentoo philosophy would look like https://github.com/SpinalHDL/VexRiscv ;). Don't want an MMU? Fine. Need a larger RAM interface? You got it. Barrel ALU for DSP? Sure.

Interpreted languages work by consolidating all of the optimization effort in the interpreter. This is similar to how CPUs work now, instead of extremely specific optimizations that are hard to create distributed among all code we use very general optimizations that push the limits of mathematics that is centralized in a CPU.

-----

Itanium had a lot of contemporary issues that made it not work. I would certainly blame Intel's business practices and reputation for a large part of it. There are likely niches for such processors. The VLIW is useful for DSP or graphics. Indeed, the only extant VLIW (that I know of) processor is the Russian Elbrus. I think the VLIW is only included to let them reuse a lot of the core logic of the CPU to drive a DSP engine, useful for radar and scientific simulation, though the sci sim would probably use commercial hardware which would be faster.

It works on GPUs because they're doing DSP, basically. We could have weirder topologies for GPUs however, like a massive string of ALUs driven off an embedded core, so you try to kachunk all your data in a single clock domain after configuring the ALU string.

Dylan16807 · on Aug 1, 2022

If you designed the instruction set around making parallelism and pipelines fast for the CPU to parse, in a way that encourages and accelerates out-of-order scheduling, you'd have some interesting potential in something that is very different from Itanium.

dragontamer · on Aug 1, 2022

There's some very interesting flags at the lowest level of NVidia SASS (machine code) that the PTX compiler adds.

https://arxiv.org/pdf/1804.06826.pdf

PDF Page 14 (physical page 12) shows that every instruction on NVidia Volta SASS has 4-bits of "Reuse flags", 6-bits of "wait barrier mask", 3-bits of "Read Barrier Index", 3-bits of "Write Barrier Index", 1-bit "Yield Flag" and 4-bits "Stall Cycles".

> Wait barrier mask; Read/Write barrier index. While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. Volta, Pascal and Maxwell use dependency barriers to track the completion of variable-latency instructions and resolve data hazards. When a variablelatency instruction writes to a register, the assembler associates it to one of the 6 available barriers by setting the corresponding write barrier number field. When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that barrier in the wait barrier mask. The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the wait barrier mask is a bitmask, not an index.

--------

It seems like NVidia has invented something very interesting here. I don't quite understand it myself, but its quite possibly what you're talking about.

__EVERY__ instruction, on NVidia machines Volta and newer (and one control-every 3 instructions on Pascal, and 7 on older Kepler systems). So these "bundles" could be seen as "better-Itanium" back in Kepler/Pascal, but perhaps the modern NVidia GPU core is fast enough to have such pre-compiled dynamic behavior / barrier interpretation every instruction these days.

It seems like a lot of instruction-bandwidth to eat up though, since each instruction on Volta is 128-bits / 16-bytes long since there's so much control information