How close is RISC-V to RISC-I? (2017)

kragen · on Nov 5, 2022

Much closer than I would have expected. But what's this about RISC-V not having a jump with a register and offset address? Doesn't JALR have a 12-bit offset that's added to the base register rs1? How is that different from RISC-I's JMP COND, S2(Rx)? Does COND mean the jump is conditional, like JMPR EQ?

Interestingly since this article most of the "In RISC-V, but not RISC-I" instructions have been moved to extensions: Zicsr for the CSR instructions and Zifencei for the FENCE.I instruction (which turns out not to be useful for user processes running on a multicore processor with a multitasking OS like FreeBSD or Linux). That leaves only AUIPC, SBREAK (now called EBREAK), and FENCE. So it's even closer now than it was in 02017.

Dylan16807 · on Nov 6, 2022

> Does COND mean the jump is conditional, like JMPR EQ?

Yes. The better generic description would have been Branch with register + offset but the RISC-I specific language uses "Jump" for everything.

kragen · on Nov 6, 2022

So what's missing is conditional branch to a runtime-computed address.

Y_Y · on Nov 5, 2022

Had anyone tried to find (or derive) an ideal set of instructions for a processor architecture, with any assumptions about arithmetic etc? For example you can try to minimise the number of instructions it takes to encode some set of nontrivial programs. (Of course how this would translate to silicon is another question.)

kragen · on Nov 6, 2022

To minimize the number of instructions for some finite set of nontrivial programs, you define the instruction set to have one instruction that performs each of the programs. Then each of those programs is one instruction long. (Other programs may or may not be possible to encode at all.)

If the set of nontrivial programs is infinite, the situation is more complex, because an instruction set as normally defined is finite. Also, you can't equally weight every program from an infinite set, so you need some kind of weighting function among the programs. Which instruction set is ideal for encoding this infinite set of programs depends almost entirely on what that weighting function is. In fact, you can usually derive the optimal instruction set directly from the weighting function. Which is to say, when you design the weighting function, you are more or less designing the instruction set.

As retrac points out, it comes down to a question of tradeoffs. There's no objectively optimal answer to which tradeoffs you should prefer.

I took some notes some years ago on the subject of dense instruction sets, optimized for the number of bits to encode programs rather than the number of instructions; perhaps you will find them of interest: https://dercuano.github.io/notes/tiny-interpreters-for-micro...

anyfoo · on Nov 6, 2022

> (Other programs may or may not be possible to encode at all.)

Well you would want your instruction set to be Turing complete, but given that the stated goal was to only optimize for those programs for which you already added the super specialized instructions, you can just trivially add whatever “boring” instruction set you like and have uselessly checked that box, as well.

Bonus points if you use an instruction from a One Instruction Set Computer (https://en.wikipedia.org/wiki/One-instruction_set_computer ) just for style, and to claim that you have further minimized the instruction set for any program outside the stated domain!

twic · on Nov 6, 2022

> Well you would want your instruction set to be Turing complete

Okay, add one more instruction TURX, which executes a Turing machine.

anyfoo · on Nov 6, 2022

Hah, even better.

posterboy · on Nov 6, 2022

Classical Turing machines have terrible runtime complexity

anyfoo · on Nov 6, 2022

Nobody claimed otherwise!

Y_Y · on Nov 6, 2022

Thanks for the notes. Just to clarify, (especially since other respondents misinterpreted to a less interesting question) I was thinking about the number of operations in the binary, not the number of distinct operations in the instruction set. Thus is probably equivalent to your case of number of bits, as long as you don't have some crazy large instruction set.

artemonster · on Nov 5, 2022

Ideal in what regard? Easy to code for humans or generate for compilers? Code size or decoding simlicity? Parallelism or cache friendliness? Energy efficiency or low cpi? Flexibility? Future-proofness or backward compatibility? In theory you only need one instruction (subleq), everything else is a highly narrow optimization attempt.

SanjayMehta · on Nov 6, 2022

I remember a lot of work in this area in the late 80s before RISC really took off.

If you search for "frequency analysis of instruction sets" in Google Scholar you should find something.

bobsmooth · on Nov 6, 2022

CPUs are general purpose computers. "An idea set of instructions" depends on what you're designing for. If you already know what you're computing, you design an ASIC.

childintime · on Nov 6, 2022

Many have tried to derive an optimal instruction set, and RISC-V is such an attempt. But for now trying is failing, and one can only chose the way one will fail. With the possible exception of only the Mill architecture, but that architecture requires a grand scale.

One common bottleneck, orienting instruction set design, is the need to pull in constants (indeed the instructions themselves are). One could theoretically share them between functions, and think of smart ways to reduce the amount of code. But in practice this is futile, as the modern CPU is a race car on a highway, if not a train on rails, so it can't be bothered to stop and deal with that kind of smartness. There's just to much inertia behind you... once you approach the speed of light. The road itself should have all information needed to let the CPU drive the correct track. And no special cases.

In addition, almost all CPU's are designed to execute a C type machine abstraction, because that's the paradigm. Though I root for the Mill, my guess is we'll only see a different instruction set once the paradigm shifts towards another execution model, likely one that is less stack bound.

childintime · on Nov 6, 2022

Continuing..

For all the performance a modern CPU has, and all the visual magic it can do, for most of us our PC's are still just as fast as 20 years ago. That is something that needs mending first. 100x faster, yet where did all those cycles go? Seems pathological to me.

Tech has been operating like politics: ever more promises. The vector that realizes itself from those promises is an increase of dependency, an increase of control and surveillance, and an erosion of self-reliance. They become stronger, not us. The paradigm forces you to think that the heroes of the paradigm, like big tech, are the solution, while they equally increase the problem. When a "Jedi instruction set" comes it is likely OoO like a christmas child.

Someone · on Nov 7, 2022

> for most of us our PC's are still just as fast as 20 years ago. That is something that needs mending first. 100x faster, yet where did all those cycles go? Seems pathological to me.

Nobody would buy a PC that’s 100 times as fast but doesn’t have new abilities. 20 years ago, Blu-ray didn’t exist yet, and it featured unprecedented full HD video. Today, 4K more or less is the norm. The original iPhone (about 15 years old) had a 480 × 320 pixel screen. The latest ones have 1000+ × 2500+, at ¿quadruple? the screen refresh rate and ¿double? the number of bits per pixel.

Progress is even more enormous with cameras. Resolution had gone up relatively moderately (for iPhone from 2 megapixel ⇒ 48 megapixel), but the software behind it couldn’t realistically run on 20 year old hardware.

Nomentatus · on Nov 6, 2022

I wish someone more competent than myself had taken you up on this. Now that we have big die sizes, is it possible to use some of that real estate to create a fast core with a different instruction set (with much less to worry about) that we can slip into with maybe just one app at a time and let 'er just zoom?

If we had to reserve some ram exclusively to that core even when not in use; I'd still want it for the times I have a very demanding task I'm waiting on.

This doesn't address all the bloat of course; but I'd like it.

OoO ? Overlooked? Overwritten? Overlooked or Overwritten?

retrac · on Nov 6, 2022

In terms of length and density at least, the optimal solution is something like compression with Huffman coding, as it is fundamentally a compression question, where we want to represent the most frequent operations and operands with the fewest bits, on average. As with any question of compression, the optimal format will depend on the nature of the compressed data. So at least in that domain, there is no truly ideal solution, only approximately ideal solutions. (The approximately ideal solution is quite good though -- in practice Huffman-encoded instructions for stack machines with a rich operation set based on an average instruction mix, come around 4 bits per instruction on average, IIRC. I've seen the approach used with a VM interpreter on a memory-constrained embedded system to pack in more code.) I suspect the same is true of other criteria you might select for what makes an ideal instruction set; it's always going to be some compromising trade-offs, some of which depend on runtime properties like actual instruction mix.

zogomoox · on Nov 6, 2022

Would it be possible to have fast stack machine CPUs nowadays? (let's say, competing with x64 cpu's) or is there some inherent benefit to register machines?

Taniwha · on Nov 6, 2022

I think that one of the main issues with stack machines is that the order of instructions in the original instruction stream is constrained by the instruction format - you can't as easy software schedule instructions (harder to express "I have all the information required to load that value from memory around now, I'll start it now because it might take a while").

Now any high performance stack machine is likely to decompose stack code to register renaming and micro-ops so some of that stuff is going to happen anyway (as it would in a register machine) and some stack ops (like dup) are going to disappear into the renamer (but then a compiler would do that as a CSE on a register machine)

kragen · on Nov 6, 2022

I'm pretty sure a GA144 can do more additions or ANDs per second than any comparably-sized register-machine microcontroller: IIRC it's about a million transistors (comparable to an 80486, seven AVRs, or nine ARM9TDMIs) and can do about a hundred billion 18-bit additions per second on a few milliwatts. It definitely doesn't decompose stack code to register renaming and micro-ops.

Unfortunately you can't program it either in C or in Verilog, and it doesn't have multipliers, so it hasn't seen significant adoption. But its problem is that it's hard to program, not that its performance is poor.

userbinator · on Nov 6, 2022

That seems closer to the architecture of a GPU than a CPU.

kragen · on Nov 6, 2022

They definitely have some things in common, but the GA144 doesn't have multipliers, doesn't do SIMT, doesn't do SIMD, doesn't have texture mapping units, and doesn't have any globally accessible RAM or routing, so I don't think it's very close.

sitzkrieg · on Nov 6, 2022

would eg, transpiling C to forth address this adequately?

kragen · on Nov 6, 2022

No, the problem is that each of the 144 processors in the GA144 has 64 18-bit words of RAM, which are not byte-addressable. Each word holds 3-4 5-bit instructions, so that's 256 machine instructions if you don't use any RAM for data. Even if you had C-friendly features like a stack pointer and stack-pointer-relative addressing modes, 64 words is just not very much memory for a C program.

You might be able to compile a C program to a floorplan of processors passing messages to one another, but it would be similar to writing a SystemC compiler which compiles C to, ultimately, a circuit netlist. The impedance mismatch is severe.

Findecanor · on Nov 6, 2022

A related question is that of code density, but that depends on more factors than just on which raw arithmetic or bitwise ops you have available. For instance, having a greater number of registers can be a benefit for some algorithms. Then, how many bits are used to encode each instruction.

There have been studies comparing code-density of common algorithms compiled to or hand-optimised for different instruction sets. For instance: <https://web.eece.maine.edu/~vweaver/papers/iccd09/ll_documen...>

In general, the top densest have been 2-address CISC ISAs such as Motorola 680x0 and x86 variants, followed by 2-address RISC for embedded processors such as Super-H, ARM Thumb and RISC-V's C-extension.

renox · on Nov 6, 2022

There's the issue of integers overflow, on Rust they're only detected in debug mode not in release mode.

The MIPS had 'trap on overflow ' integer arithmetic instructions but sadly the RISC-V doesn't have those..

Sniffnoy · on Nov 6, 2022

Why is this an archive.org link? The page is still up: https://aspire.eecs.berkeley.edu/2017/06/how-close-is-risc-v...