An MMU is somewhere on the TODO list. I'm still focusing on user space and I still have not added support for interrupts or exceptions.
Exceptions in particular is something that I want to research more before I dive into implementation. So far I am of the opinion that memory faults is the only exception kind that you actually expect to see during normal program execution (IEEE-754 exceptions don't count, they are archaic and not really necessary). All other exceptions are exceptional and usually result in abnormal program termination (i.e. there's really no need to provide low latency precise exception handling for them).
With that in mind, I want to investigate how to best handle memory faults, rather than just implement exceptions (although memory faults are usually handled by exceptions).
There's also the issue that I want to support gather-scatter vector memory operations, which can be tricky w.r.t. memory faults (potentially a single instruction can produce one memory fault for every vector element).
>(understood as pointed at a popular RISC architecture) lack of addressing modes
Politely disagree. He should provide hard evidence (hard numbers), as the authors of that ISA already did the research, and came to a different conclusion, documented in the spec.
>You don't want to argue with Seymour Cray. Or with John Mashey, for that matter.
Sure, yet they definitely want to argue against the third one, as unlike the other two, it isn't an historically significant ISA that has now been replaced by something else.
Particularly, if they're implying there's something wrong with it and that they can do better, they should own up to that. Document what's wrong. Prove their way actually is better.
As it is right now, I have to understand it is just another instance of "the addressing modes I like having aren't there, so I made an ISA with them". A common occurrence, unfortunately.
I'm not alone. Many others with far more experience and know-how in both ISA design and micro architecture design have the same opinions.
It's not that the RISC-V solution is "wrong" or "bad", it's just that it was designed to cater to all needs, including the super-simple very low end (the kind that you can implement from scratch in a weekend). I think that RISC-V does that very well, but it comes at a cost. Being good at everything pretty much precludes being the best in every segment.
MRISC32 tries to be one step up from RISC-V in terms of performance targets (well, realistically MRISC64 would be that ISA), but one step down from AArch64 in terms of complexity.
I disagree: a lot of CPU ISA design decisions are a result of the state of the technology available to implement these CPU.
As the technology change, design decisions should be revisited: see here https://queue.acm.org/detail.cfm?id=3639445 for someone saying the same thing.
POWER is an example of a modern "RISC" architecture with lots of addressing mode..
The RISC-V authors rely heavily on the argument and assumption that any high performance implementation will do instruction fusion, and the lack of indexed addressing is precisely one of the things that is expected to be fused (see page 6 in https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130...).
Any implementation that does not do instruction fusion (most implementations - especially lower end ones) are out of luck, and have to pay the penalty of executing extra instructions.
All modern high performance ISA:s include indexed addressing modes (most also have scaled indexed addressing). Some RISC-V implementations add these as non-standard extensions.
RISC-V was not designed primarily for high performance. It targets the extremely low end (education, microcontrollers), and does it really well, and they have this clever trick (compression + instruction fusion) that makes the situation OK:ish for high end implementations. But it's not ideal.
Also worth noting is that instruction fusion is far from free. It may be negligible in a really big implementation that already does things like instruction translation (uOP caches etc), but it sure feels like an unnecessary burden to bring to an ISA that hasn't even had the time to build up a legacy baggage yet.
Furthermore, fused instructions often "leak" data to architectural registers (e.g. temporary registers used for address calculation), which both increases register pressure and requires that fused internal instructions must produce extra outputs (often unnecessarily).
void store(int* array, int index, int value) {
array[index] = value;
}
MRISC32:
store(int*, int, int):
stw r3, [r1, r2*4]
ret
RISC-V:
store(int*, int, int):
slli a1,a1,2 // writes to a1
add a0,a0,a1 // writes to a0
sw a2,0(a0)
ret
Note how the RISC-V version clobbers both the A1 and the A0 registers. Thus, a properly fused instruction would have to read from three registers and write to two register (that's bad!), whereas the MRISC32 solution only has to read three registers - no registers need to be written to (same thing for ARM and others). (You could teach the compiler to do an "add a1,a0,a1" instead so that only one register is clobbered, but that adds complexity to the compiler, and the micro architecture may not be able to rely such compiler optimizations but would most likely have to support the case with two clobbered registers anyway).
Edit 2: Another main reason for not including indexed addressing modes in RISC-V is that indexed stores require three source registers (i.e. at least three register file read ports are needed) - which is a big no-no in the RISC-V philosophy. For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.
Minor note - RISC-V's Zba (included in RVA22, which afaik is set to be the baseline requirement for Android) has sh2add for a combined x+y*4, getting that store down to two instructions (testable with -march=rv64gczba); doesn't affect the overall point of needless clobbering of registers from the instruction fusion approach though (and shouldn't indexed loads be even worse, having three potential output registers in baseline RV64GC, and still two with Zba?)
That's the same approach that gets you from 2 to 1 clobbered registers for stores, no? i.e. usually possible, but needs compiler participation; if assuming the compiler will optimize load register usage properly, it shouldn't have any problem reducing the store clobbers from 2 to 1 either, which I understood as the difficult part. (or course 1 clobber is still worse than 0, but at least it means no 2 writes/fused instr, at least here)
> store(int*, int, int):
> slli a1,a1,2 // writes to a1
> add a0,a0,a1 // writes to a0
> sw a2,0(a0)
> ret
Three points:
1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`
2) since late 2021 RISC-V has the `sh2add` instruction (and shifts of 1 and 3, and also versions that widen and shift a 32 bit unsigned value on a 64 bit CPU) that replaces the `slli` and `add`, thus naturally clobbering only one register. These instructions are already present in the popular RISC-V boards that use the JH7110 SoC (VisionFive 2, Star64, Mars, PineTab-V, Roma)
3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code. If you care about the performance it's going to be because it's in a loop, and the function will be inlined, and strength-reduced, and the actual code will look like:
sw a2,(a0)
addi a0,a0,4
I have in the past proposed adding simple reg+reg indexing to loads (only), as that fits within the 2r1w framework and loads are more numerous than stores. Similarly, stores could write the effective address back to the base register, which would also fit in the 2r1w framework.
This would allow something like ...
void arrAdd(int *dst, int *src1, int *src2, int n) {
for (int i=0; i<n; ++i)
dst[i] = src1[i] + src2[i];
}
... to be compiled as ...
arrAdd:
sh2add a3,a3,a0 // just past end of dst
addi a0,a0,-4 // bias for incr by 4 in the store
sub a1,a1,a0 // bias for indexed addressing
sub a2,a2,a0
1:
lw a4,(a1,a0)
lw a5,(a2,a0)
add a4,a4,a5
swwb a4,4(a0) // store AND a0 = a0+4
blt a0,a3,1b
ret
This is the same number of instructions as now (no code size saving) but three fewer instructions in the loop.
> For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.
It gives optimal price-performance. Having three read ports is rather expensive, and nowhere near justified by the number of integer instructions that would ever actually make use of the 3rd port, even if you added indexed stores, conditional selects, double-width shifts with dynamic shift count, multiply-add. Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code and besides IEEE rules require a single rounding in a multiply-add.
> Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code
The issue with FMADD is that it requires 4 registers which takes up way too much encoding space. It could've been specified in the usual 3 register format as Fd ← Fd + (Fs1 * Fs2) , which is how the instruction is typically implemented in DSPs. If you really need a separate Fs3 register, that's just a two-insn sequence Fd ← Fs3; Fd ← Fd + (Fs1 * Fs2) which ought to be highly amenable to insn fusion. I do wonder why it wasn't done that way originally, and why the choice was made to use so much space with a new R4 format. (Mind you, this could be fixed by standardizing the new form, adding a new Z-subset of the F extension without the FMADD encoding blocks, and soft-deprecating the original F. But still.)
Using up a lot of encoding space for four registers is indeed a problem, but it's a different problem. Even if you restrict the instruction to Fd ← Fd + (Fs1 * Fs2) and/or Fd ← Fs1 + (Fd * Fs2) (the RISC-V Vector extension provides both) you still have to read three registers.
> 1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`
Yes, as I said.
Please note that the code was generated by GCC (trunk) that has fairly mature RISC-V support (it has had RISC-V support since 2017 at least, and the fusion optimization was suggested even earlier than that). If fusion of load/store instructions is indeed an important part of the RISC-V design, why has this not been fixed in the compiler during the last 6+ years? I take it as there is no hard canonical way that register allocation should be done in these situations, which means that HW implementations must assume that such a sequence may clobber two registers.
> 2) since late 2021 RISC-V has the `sh2add` instruction
Which is great. As you say, not really a win in terms of code density, but definitely fewer degrees of freedom and only one clobbered register. Still not as good as dedicated indexed load/store instructions, though.
BTW, while I agree that an indexed load without indexed store would be better than nothing (I actually considered that for MRISC32), I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).
> 3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code
True that GCC happily and often transforms indexing inside loops to address increments rather than index increments (even so for architectures that has support for indexed addressing).
The point of the example, though, was to highlight that when the compiler needs to do indexed addressing, it must clobber unrelated registers, which is a problem for instruction fusion (i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects).
> It gives optimal price-performance.
Probably. But this circles back to the question about what segments you target with your design. I am fully convinced that RISC-V is the best ISA choice for many segments, especially the cost sensitive segments. However I'm less convinced that RISC-V (as it is) is the optimal choice for less cost sensitive high performance segments (e.g. premium products like those from Apple, gaming riggs, compute servers, and so on). It can probably do a good job there (just like x86 can do a good job, despite all its ISA bloat and shortcomings), but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments (at least that is my belief).
> If fusion of load/store instructions is indeed an important part of the RISC-V design
I don't believe that it is. Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).
I have heard that some of the companies making very high performance cores (Apple M or AMD Zen class) are incorporating some fusion. Perhaps they will provide compiler patches if required.
> I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).
I think they would manage. There is certainly precedent e.g. MSP430 has 4 addressing modes for src operands but only 2 for dst (register, register indirect with 16 bit offset). Plus of course every ISA where its an addressing mode not a different instruction allows "immediate" on a load but not on a store.
> i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects
That can never happen. Fusion will simply not be done in that case.
> but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments
And I don't think the difference would be enough to measure or care about — single digit percentage on number of clock cycles needed. And meanwhile the simpler core may allow at least the same percentage to be gained as higher clock speed, or more cores on a die, or other compensating factors.
> Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).
Most of the targets require the same destination, so it won't be able to fuse current codegen. I suppose there is still some time before compilers need to be ready, but it's not that much.
> Perhaps they will provide compiler patches if required.
I see like Risc-V, there are no condition flags, fine. Is there a good way to trap integer overflow and/or do multi-precision? It takes extra instructions in RISC-V which is a shortcoming.
Unlike RISC-V, MRISC32 has a madd-instruction (integer multiply+add), which is helpful in multi-precision multiplication, but there is little help in terms of carry addition arithmetic. Together with the very verbose function prologue/epilogue (explicit stores and loads from the stack, just like RISC-V, except without the "compression trick"), this is one of the main pain points of the ISA (although I don't have extensive experience with where this really matters - except obviously for 64-bit arithmetic on a 32-bit architecture, which will be less of a problem in MRISC64).
I have a couple of ideas that I intend to explore in the future. One idea that I like, but that would be a slight paradigm shift for the ISA, is the "instruction decorators" that Mitch Alsup has developed for his My 66000 ISA: You can prefix a series of instructions with a "CARRY" meta-instruction, that basically tells the CPU that "for the N following instructions, propagate a carry via register Rx" (or something similar), effectively turning 2-in-1-out instructions into 3-in-2-out instructions. It's more of an encoding trick, and works with more instructions than just "ADD".
The tricky parts (i.e. that modifies the architecture) would be 1) instructions can have two results, and 2) the decoration instruction adds state to the decoder that must be carried over during exceptions and similar. The latter may be possible to dodge, though, if certain restrictions are put in place (e.g. no exceptions, branches, speculation or anything similar can happen during a "decorated bunch").
Also, wonder how difficult would it be to add MMU to MC1 ?