Much closer than I would have expected. But what's this about RISC-V not having a jump with a register and offset address? Doesn't JALR have a 12-bit offset that's added to the base register rs1? How is that different from RISC-I's JMP COND, S2(Rx)? Does COND mean the jump is conditional, like JMPR EQ?
Interestingly since this article most of the "In RISC-V, but not RISC-I" instructions have been moved to extensions: Zicsr for the CSR instructions and Zifencei for the FENCE.I instruction (which turns out not to be useful for user processes running on a multicore processor with a multitasking OS like FreeBSD or Linux). That leaves only AUIPC, SBREAK (now called EBREAK), and FENCE. So it's even closer now than it was in 02017.
Had anyone tried to find (or derive) an ideal set of instructions for a processor architecture, with any assumptions about arithmetic etc? For example you can try to minimise the number of instructions it takes to encode some set of nontrivial programs. (Of course how this would translate to silicon is another question.)
To minimize the number of instructions for some finite set of nontrivial programs, you define the instruction set to have one instruction that performs each of the programs. Then each of those programs is one instruction long. (Other programs may or may not be possible to encode at all.)
If the set of nontrivial programs is infinite, the situation is more complex, because an instruction set as normally defined is finite. Also, you can't equally weight every program from an infinite set, so you need some kind of weighting function among the programs. Which instruction set is ideal for encoding this infinite set of programs depends almost entirely on what that weighting function is. In fact, you can usually derive the optimal instruction set directly from the weighting function. Which is to say, when you design the weighting function, you are more or less designing the instruction set.
As retrac points out, it comes down to a question of tradeoffs. There's no objectively optimal answer to which tradeoffs you should prefer.
I took some notes some years ago on the subject of dense instruction sets, optimized for the number of bits to encode programs rather than the number of instructions; perhaps you will find them of interest: https://dercuano.github.io/notes/tiny-interpreters-for-micro...
> (Other programs may or may not be possible to encode at all.)
Well you would want your instruction set to be Turing complete, but given that the stated goal was to only optimize for those programs for which you already added the super specialized instructions, you can just trivially add whatever “boring” instruction set you like and have uselessly checked that box, as well.
Bonus points if you use an instruction from a One Instruction Set Computer (https://en.wikipedia.org/wiki/One-instruction_set_computer ) just for style, and to claim that you have further minimized the instruction set for any program outside the stated domain!
Thanks for the notes. Just to clarify, (especially since other respondents misinterpreted to a less interesting question) I was thinking about the number of operations in the binary, not the number of distinct operations in the instruction set. Thus is probably equivalent to your case of number of bits, as long as you don't have some crazy large instruction set.
Ideal in what regard? Easy to code for humans or generate for compilers? Code size or decoding simlicity? Parallelism or cache friendliness? Energy efficiency or low cpi? Flexibility? Future-proofness or backward compatibility? In theory you only need one instruction (subleq), everything else is a highly narrow optimization attempt.
CPUs are general purpose computers. "An idea set of instructions" depends on what you're designing for. If you already know what you're computing, you design an ASIC.
Many have tried to derive an optimal instruction set, and RISC-V is such an attempt. But for now trying is failing, and one can only chose the way one will fail.
With the possible exception of only the Mill architecture, but that architecture requires a grand scale.
One common bottleneck, orienting instruction set design, is the need to pull in constants (indeed the instructions themselves are). One could theoretically share them between functions, and think of smart ways to reduce the amount of code. But in practice this is futile, as the modern CPU is a race car on a highway, if not a train on rails, so it can't be bothered to stop and deal with that kind of smartness. There's just to much inertia behind you... once you approach the speed of light. The road itself should have all information needed to let the CPU drive the correct track. And no special cases.
In addition, almost all CPU's are designed to execute a C type machine abstraction, because that's the paradigm. Though I root for the Mill, my guess is we'll only see a different instruction set once the paradigm shifts towards another execution model, likely one that is less stack bound.
For all the performance a modern CPU has, and all the visual magic it can do, for most of us our PC's are still just as fast as 20 years ago. That is something that needs mending first. 100x faster, yet where did all those cycles go? Seems pathological to me.
Tech has been operating like politics: ever more promises. The vector that realizes itself from those promises is an increase of dependency, an increase of control and surveillance, and an erosion of self-reliance. They become stronger, not us. The paradigm forces you to think that the heroes of the paradigm, like big tech, are the solution, while they equally increase the problem. When a "Jedi instruction set" comes it is likely OoO like a christmas child.
> for most of us our PC's are still just as fast as 20 years ago. That is something that needs mending first. 100x faster, yet where did all those cycles go? Seems pathological to me.
Nobody would buy a PC that’s 100 times as fast but doesn’t have new abilities. 20 years ago, Blu-ray didn’t exist yet, and it featured unprecedented full HD video. Today, 4K more or less is the norm. The original iPhone (about 15 years old) had a 480 × 320 pixel screen. The latest ones have 1000+ × 2500+, at ¿quadruple? the screen refresh rate and ¿double? the number of bits per pixel.
Progress is even more enormous with cameras. Resolution had gone up relatively moderately (for iPhone from 2 megapixel ⇒ 48 megapixel), but the software behind it couldn’t realistically run on 20 year old hardware.
I wish someone more competent than myself had taken you up on this. Now that we have big die sizes, is it possible to use some of that real estate to create a fast core with a different instruction set (with much less to worry about) that we can slip into with maybe just one app at a time and let 'er just zoom?
If we had to reserve some ram exclusively to that core even when not in use; I'd still want it for the times I have a very demanding task I'm waiting on.
This doesn't address all the bloat of course; but I'd like it.
OoO ? Overlooked? Overwritten? Overlooked or Overwritten?
In terms of length and density at least, the optimal solution is something like compression with Huffman coding, as it is fundamentally a compression question, where we want to represent the most frequent operations and operands with the fewest bits, on average. As with any question of compression, the optimal format will depend on the nature of the compressed data. So at least in that domain, there is no truly ideal solution, only approximately ideal solutions. (The approximately ideal solution is quite good though -- in practice Huffman-encoded instructions for stack machines with a rich operation set based on an average instruction mix, come around 4 bits per instruction on average, IIRC. I've seen the approach used with a VM interpreter on a memory-constrained embedded system to pack in more code.) I suspect the same is true of other criteria you might select for what makes an ideal instruction set; it's always going to be some compromising trade-offs, some of which depend on runtime properties like actual instruction mix.
Would it be possible to have fast stack machine CPUs nowadays? (let's say, competing with x64 cpu's) or is there some inherent benefit to register machines?
I think that one of the main issues with stack machines is that the order of instructions in the original instruction stream is constrained by the instruction format - you can't as easy software schedule instructions (harder to express "I have all the information required to load that value from memory around now, I'll start it now because it might take a while").
Now any high performance stack machine is likely to decompose stack code to register renaming and micro-ops so some of that stuff is going to happen anyway (as it would in a register machine) and some stack ops (like dup) are going to disappear into the renamer (but then a compiler would do that as a CSE on a register machine)
I'm pretty sure a GA144 can do more additions or ANDs per second than any comparably-sized register-machine microcontroller: IIRC it's about a million transistors (comparable to an 80486, seven AVRs, or nine ARM9TDMIs) and can do about a hundred billion 18-bit additions per second on a few milliwatts. It definitely doesn't decompose stack code to register renaming and micro-ops.
Unfortunately you can't program it either in C or in Verilog, and it doesn't have multipliers, so it hasn't seen significant adoption. But its problem is that it's hard to program, not that its performance is poor.
They definitely have some things in common, but the GA144 doesn't have multipliers, doesn't do SIMT, doesn't do SIMD, doesn't have texture mapping units, and doesn't have any globally accessible RAM or routing, so I don't think it's very close.
No, the problem is that each of the 144 processors in the GA144 has 64 18-bit words of RAM, which are not byte-addressable. Each word holds 3-4 5-bit instructions, so that's 256 machine instructions if you don't use any RAM for data. Even if you had C-friendly features like a stack pointer and stack-pointer-relative addressing modes, 64 words is just not very much memory for a C program.
You might be able to compile a C program to a floorplan of processors passing messages to one another, but it would be similar to writing a SystemC compiler which compiles C to, ultimately, a circuit netlist. The impedance mismatch is severe.
A related question is that of code density, but that depends on more factors than just on which raw arithmetic or bitwise ops you have available. For instance, having a greater number of registers can be a benefit for some algorithms. Then, how many bits are used to encode each instruction.
In general, the top densest have been 2-address CISC ISAs such as Motorola 680x0 and x86 variants, followed by 2-address RISC for embedded processors such as Super-H, ARM Thumb and RISC-V's C-extension.
Interestingly since this article most of the "In RISC-V, but not RISC-I" instructions have been moved to extensions: Zicsr for the CSR instructions and Zifencei for the FENCE.I instruction (which turns out not to be useful for user processes running on a multicore processor with a multitasking OS like FreeBSD or Linux). That leaves only AUIPC, SBREAK (now called EBREAK), and FENCE. So it's even closer now than it was in 02017.