MRISC32 – An Open 32-Bit RISC/Vector ISA (Suitable for FPGA CPU)

ruslan · on Jan 17, 2024

Is there GCC toolchain already for this ISA ?

Also, wonder how difficult would it be to add MMU to MC1 ?

mbitsnbites · on Jan 18, 2024

An MMU is somewhere on the TODO list. I'm still focusing on user space and I still have not added support for interrupts or exceptions.

Exceptions in particular is something that I want to research more before I dive into implementation. So far I am of the opinion that memory faults is the only exception kind that you actually expect to see during normal program execution (IEEE-754 exceptions don't count, they are archaic and not really necessary). All other exceptions are exceptional and usually result in abnormal program termination (i.e. there's really no need to provide low latency precise exception handling for them).

With that in mind, I want to investigate how to best handle memory faults, rather than just implement exceptions (although memory faults are usually handled by exceptions).

There's also the issue that I want to support gather-scatter vector memory operations, which can be tricky w.r.t. memory faults (potentially a single instruction can produce one memory fault for every vector element).

snvzz · on Jan 17, 2024

>gcc

There is a port. If you click on resources on the page's top bar, it yields this page[0].

0. https://mrisc32.bitsnbites.eu/resources.html

mbitsnbites · on Jan 18, 2024

Yup. Latest and greatest GCC 14: https://gitlab.com/mrisc32/mrisc32-gnu-toolchain

snvzz · on Jan 17, 2024

>unfortunate design decisions

>(understood as pointed at a popular RISC architecture) lack of addressing modes

Politely disagree. He should provide hard evidence (hard numbers), as the authors of that ISA already did the research, and came to a different conclusion, documented in the spec.

brucehoult · on Jan 18, 2024

All three ISAs referenced as inspiration have the same (lack of) addressing modes: base register plus offset, and that's it.

You don't want to argue with Seymour Cray. Or with John Mashey, for that matter.

snvzz · on Jan 18, 2024

>You don't want to argue with Seymour Cray. Or with John Mashey, for that matter.

Sure, yet they definitely want to argue against the third one, as unlike the other two, it isn't an historically significant ISA that has now been replaced by something else.

Particularly, if they're implying there's something wrong with it and that they can do better, they should own up to that. Document what's wrong. Prove their way actually is better.

As it is right now, I have to understand it is just another instance of "the addressing modes I like having aren't there, so I made an ISA with them". A common occurrence, unfortunately.

mbitsnbites · on Jan 18, 2024

> the addressing modes I like having aren't there, so I made an ISA with them

Not really... Se my other comment: https://news.ycombinator.com/item?id=39042048

I'm not alone. Many others with far more experience and know-how in both ISA design and micro architecture design have the same opinions.

It's not that the RISC-V solution is "wrong" or "bad", it's just that it was designed to cater to all needs, including the super-simple very low end (the kind that you can implement from scratch in a weekend). I think that RISC-V does that very well, but it comes at a cost. Being good at everything pretty much precludes being the best in every segment.

MRISC32 tries to be one step up from RISC-V in terms of performance targets (well, realistically MRISC64 would be that ISA), but one step down from AArch64 in terms of complexity.

renox · on Jan 18, 2024

I disagree: a lot of CPU ISA design decisions are a result of the state of the technology available to implement these CPU. As the technology change, design decisions should be revisited: see here https://queue.acm.org/detail.cfm?id=3639445 for someone saying the same thing.

POWER is an example of a modern "RISC" architecture with lots of addressing mode..

mbitsnbites · on Jan 18, 2024

Hi Bruce! :-) Well, to be fair I said "inspired by", not "copied from".

BTW, I'm not trying to bash RISC-V (although it may come across that way sometimes).

mbitsnbites · on Jan 18, 2024

The RISC-V authors rely heavily on the argument and assumption that any high performance implementation will do instruction fusion, and the lack of indexed addressing is precisely one of the things that is expected to be fused (see page 6 in https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130...).

Any implementation that does not do instruction fusion (most implementations - especially lower end ones) are out of luck, and have to pay the penalty of executing extra instructions.

All modern high performance ISA:s include indexed addressing modes (most also have scaled indexed addressing). Some RISC-V implementations add these as non-standard extensions.

RISC-V was not designed primarily for high performance. It targets the extremely low end (education, microcontrollers), and does it really well, and they have this clever trick (compression + instruction fusion) that makes the situation OK:ish for high end implementations. But it's not ideal.

Also worth noting is that instruction fusion is far from free. It may be negligible in a really big implementation that already does things like instruction translation (uOP caches etc), but it sure feels like an unnecessary burden to bring to an ISA that hasn't even had the time to build up a legacy baggage yet.

Furthermore, fused instructions often "leak" data to architectural registers (e.g. temporary registers used for address calculation), which both increases register pressure and requires that fused internal instructions must produce extra outputs (often unnecessarily).

Edit: Consider the following code (https://godbolt.org/z/8YdfoT34o):

  void store(int* array, int index, int value) {
    array[index] = value;
  }

MRISC32:

  store(int*, int, int):
      stw     r3, [r1, r2*4]
      ret

RISC-V:

  store(int*, int, int):
      slli    a1,a1,2   // writes to a1
      add     a0,a0,a1  // writes to a0
      sw      a2,0(a0)
      ret

Note how the RISC-V version clobbers both the A1 and the A0 registers. Thus, a properly fused instruction would have to read from three registers and write to two register (that's bad!), whereas the MRISC32 solution only has to read three registers - no registers need to be written to (same thing for ARM and others). (You could teach the compiler to do an "add a1,a0,a1" instead so that only one register is clobbered, but that adds complexity to the compiler, and the micro architecture may not be able to rely such compiler optimizations but would most likely have to support the case with two clobbered registers anyway).

Edit 2: Another main reason for not including indexed addressing modes in RISC-V is that indexed stores require three source registers (i.e. at least three register file read ports are needed) - which is a big no-no in the RISC-V philosophy. For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.

dzaima · on Jan 18, 2024

Minor note - RISC-V's Zba (included in RVA22, which afaik is set to be the baseline requirement for Android) has sh2add for a combined x+y*4, getting that store down to two instructions (testable with -march=rv64gczba); doesn't affect the overall point of needless clobbering of registers from the instruction fusion approach though (and shouldn't indexed loads be even worse, having three potential output registers in baseline RV64GC, and still two with Zba?)

mbitsnbites · on Jan 18, 2024

For loads you can usually hide the clobbering by using the intended target register (the one you want to load to) as the temporary register, so e.g:

      // a0 = load [a1 + a2*4]
      slli    a0,a2,2
      add     a0,a0,a1
      lw      a0,0(a0)

...but not always.

dzaima · on Jan 18, 2024

That's the same approach that gets you from 2 to 1 clobbered registers for stores, no? i.e. usually possible, but needs compiler participation; if assuming the compiler will optimize load register usage properly, it shouldn't have any problem reducing the store clobbers from 2 to 1 either, which I understood as the difficult part. (or course 1 clobber is still worse than 0, but at least it means no 2 writes/fused instr, at least here)

mbitsnbites · on Jan 18, 2024

True, but for loads there is at least a theoretical way to avoid clobbers. For stores there is no way.

Stores: 1-2 superfluous results

Loads: 0-2 superfluous results

brucehoult · on Jan 19, 2024

    > store(int*, int, int):
    >     slli    a1,a1,2   // writes to a1
    >     add     a0,a0,a1  // writes to a0
    >     sw      a2,0(a0)
    >     ret

Three points:

1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`

2) since late 2021 RISC-V has the `sh2add` instruction (and shifts of 1 and 3, and also versions that widen and shift a 32 bit unsigned value on a 64 bit CPU) that replaces the `slli` and `add`, thus naturally clobbering only one register. These instructions are already present in the popular RISC-V boards that use the JH7110 SoC (VisionFive 2, Star64, Mars, PineTab-V, Roma)

3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code. If you care about the performance it's going to be because it's in a loop, and the function will be inlined, and strength-reduced, and the actual code will look like:

    sw   a2,(a0)
    addi a0,a0,4

I have in the past proposed adding simple reg+reg indexing to loads (only), as that fits within the 2r1w framework and loads are more numerous than stores. Similarly, stores could write the effective address back to the base register, which would also fit in the 2r1w framework.

This would allow something like ...

    void arrAdd(int *dst, int *src1, int *src2, int n) {
      for (int i=0; i<n; ++i)
        dst[i] = src1[i] + src2[i];
    }

... to be compiled as ...

    arrAdd:
        sh2add a3,a3,a0 // just past end of dst
        addi   a0,a0,-4 // bias for incr by 4 in the store
        sub    a1,a1,a0 // bias for indexed addressing
        sub    a2,a2,a0
    1:
        lw     a4,(a1,a0)
        lw     a5,(a2,a0)
        add    a4,a4,a5
        swwb   a4,4(a0) // store AND a0 = a0+4
        blt    a0,a3,1b
        ret

This is the same number of instructions as now (no code size saving) but three fewer instructions in the loop.

> For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.

It gives optimal price-performance. Having three read ports is rather expensive, and nowhere near justified by the number of integer instructions that would ever actually make use of the 3rd port, even if you added indexed stores, conditional selects, double-width shifts with dynamic shift count, multiply-add. Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code and besides IEEE rules require a single rounding in a multiply-add.

zozbot234 · on Jan 20, 2024

> Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code

The issue with FMADD is that it requires 4 registers which takes up way too much encoding space. It could've been specified in the usual 3 register format as Fd ← Fd + (Fs1 * Fs2) , which is how the instruction is typically implemented in DSPs. If you really need a separate Fs3 register, that's just a two-insn sequence Fd ← Fs3; Fd ← Fd + (Fs1 * Fs2) which ought to be highly amenable to insn fusion. I do wonder why it wasn't done that way originally, and why the choice was made to use so much space with a new R4 format. (Mind you, this could be fixed by standardizing the new form, adding a new Z-subset of the F extension without the FMADD encoding blocks, and soft-deprecating the original F. But still.)

brucehoult · on Jan 20, 2024

Using up a lot of encoding space for four registers is indeed a problem, but it's a different problem. Even if you restrict the instruction to Fd ← Fd + (Fs1 * Fs2) and/or Fd ← Fs1 + (Fd * Fs2) (the RISC-V Vector extension provides both) you still have to read three registers.

mbitsnbites · on Jan 20, 2024

> 1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`

Yes, as I said.

Please note that the code was generated by GCC (trunk) that has fairly mature RISC-V support (it has had RISC-V support since 2017 at least, and the fusion optimization was suggested even earlier than that). If fusion of load/store instructions is indeed an important part of the RISC-V design, why has this not been fixed in the compiler during the last 6+ years? I take it as there is no hard canonical way that register allocation should be done in these situations, which means that HW implementations must assume that such a sequence may clobber two registers.

> 2) since late 2021 RISC-V has the `sh2add` instruction

Which is great. As you say, not really a win in terms of code density, but definitely fewer degrees of freedom and only one clobbered register. Still not as good as dedicated indexed load/store instructions, though.

BTW, while I agree that an indexed load without indexed store would be better than nothing (I actually considered that for MRISC32), I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).

> 3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code

True that GCC happily and often transforms indexing inside loops to address increments rather than index increments (even so for architectures that has support for indexed addressing).

The point of the example, though, was to highlight that when the compiler needs to do indexed addressing, it must clobber unrelated registers, which is a problem for instruction fusion (i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects).

> It gives optimal price-performance.

Probably. But this circles back to the question about what segments you target with your design. I am fully convinced that RISC-V is the best ISA choice for many segments, especially the cost sensitive segments. However I'm less convinced that RISC-V (as it is) is the optimal choice for less cost sensitive high performance segments (e.g. premium products like those from Apple, gaming riggs, compute servers, and so on). It can probably do a good job there (just like x86 can do a good job, despite all its ISA bloat and shortcomings), but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments (at least that is my belief).

brucehoult · on Jan 20, 2024

> Yes, as I said.

Not at the point I was composing my reply.

> If fusion of load/store instructions is indeed an important part of the RISC-V design

I don't believe that it is. Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).

I have heard that some of the companies making very high performance cores (Apple M or AMD Zen class) are incorporating some fusion. Perhaps they will provide compiler patches if required.

> I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).

I think they would manage. There is certainly precedent e.g. MSP430 has 4 addressing modes for src operands but only 2 for dst (register, register indirect with 16 bit offset). Plus of course every ISA where its an addressing mode not a different instruction allows "immediate" on a load but not on a store.

> i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects

That can never happen. Fusion will simply not be done in that case.

> but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments

And I don't think the difference would be enough to measure or care about — single digit percentage on number of clock cycles needed. And meanwhile the simpler core may allow at least the same percentage to be gained as higher clock speed, or more cores on a die, or other compensating factors.

camel-cdr · on Jan 20, 2024

> Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).

You can play around with OpenXianShan though, they have a few fusion targets: https://github.com/OpenXiangShan/XiangShan/blob/master/src/m...

Most of the targets require the same destination, so it won't be able to fuse current codegen. I suppose there is still some time before compilers need to be ready, but it's not that much.

> Perhaps they will provide compiler patches if required.

I hope so, btw t-head seems to be still be trying to upstream XTheadVector: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/64278...

camel-cdr · on Jan 20, 2024

Looks like llvm recently got some RISC-V fusion support via -mtune: https://github.com/llvm/llvm-project/commits/main/llvm/lib/T...

peter_d_sherman · on Jan 17, 2024

https://vimeo.com/901506667

https://news.ycombinator.com/item?id=39029529

throwaway81523 · on Jan 18, 2024

I see like Risc-V, there are no condition flags, fine. Is there a good way to trap integer overflow and/or do multi-precision? It takes extra instructions in RISC-V which is a shortcoming.

mbitsnbites · on Jan 18, 2024

It's a shortcoming in MRISC32 too.

Unlike RISC-V, MRISC32 has a madd-instruction (integer multiply+add), which is helpful in multi-precision multiplication, but there is little help in terms of carry addition arithmetic. Together with the very verbose function prologue/epilogue (explicit stores and loads from the stack, just like RISC-V, except without the "compression trick"), this is one of the main pain points of the ISA (although I don't have extensive experience with where this really matters - except obviously for 64-bit arithmetic on a 32-bit architecture, which will be less of a problem in MRISC64).

I have a couple of ideas that I intend to explore in the future. One idea that I like, but that would be a slight paradigm shift for the ISA, is the "instruction decorators" that Mitch Alsup has developed for his My 66000 ISA: You can prefix a series of instructions with a "CARRY" meta-instruction, that basically tells the CPU that "for the N following instructions, propagate a carry via register Rx" (or something similar), effectively turning 2-in-1-out instructions into 3-in-2-out instructions. It's more of an encoding trick, and works with more instructions than just "ADD".

The tricky parts (i.e. that modifies the architecture) would be 1) instructions can have two results, and 2) the decoration instruction adds state to the decoder that must be carried over during exceptions and similar. The latter may be possible to dodge, though, if certain restrictions are put in place (e.g. no exceptions, branches, speculation or anything similar can happen during a "decorated bunch").

system2 · on Jan 17, 2024

110MHz is not bad for quake. I remember playing with my Pentium 100 back in the day. It looked so realistic like today's RTX with some imagination.