The RISC-V authors rely heavily on the argument and assumption that any high per...

dzaima · on Jan 18, 2024

Minor note - RISC-V's Zba (included in RVA22, which afaik is set to be the baseline requirement for Android) has sh2add for a combined x+y*4, getting that store down to two instructions (testable with -march=rv64gczba); doesn't affect the overall point of needless clobbering of registers from the instruction fusion approach though (and shouldn't indexed loads be even worse, having three potential output registers in baseline RV64GC, and still two with Zba?)

mbitsnbites · on Jan 18, 2024

For loads you can usually hide the clobbering by using the intended target register (the one you want to load to) as the temporary register, so e.g:

      // a0 = load [a1 + a2*4]
      slli    a0,a2,2
      add     a0,a0,a1
      lw      a0,0(a0)

...but not always.

dzaima · on Jan 18, 2024

That's the same approach that gets you from 2 to 1 clobbered registers for stores, no? i.e. usually possible, but needs compiler participation; if assuming the compiler will optimize load register usage properly, it shouldn't have any problem reducing the store clobbers from 2 to 1 either, which I understood as the difficult part. (or course 1 clobber is still worse than 0, but at least it means no 2 writes/fused instr, at least here)

mbitsnbites · on Jan 18, 2024

True, but for loads there is at least a theoretical way to avoid clobbers. For stores there is no way.

Stores: 1-2 superfluous results

Loads: 0-2 superfluous results

brucehoult · on Jan 19, 2024

    > store(int*, int, int):
    >     slli    a1,a1,2   // writes to a1
    >     add     a0,a0,a1  // writes to a0
    >     sw      a2,0(a0)
    >     ret

Three points:

1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`

2) since late 2021 RISC-V has the `sh2add` instruction (and shifts of 1 and 3, and also versions that widen and shift a 32 bit unsigned value on a 64 bit CPU) that replaces the `slli` and `add`, thus naturally clobbering only one register. These instructions are already present in the popular RISC-V boards that use the JH7110 SoC (VisionFive 2, Star64, Mars, PineTab-V, Roma)

3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code. If you care about the performance it's going to be because it's in a loop, and the function will be inlined, and strength-reduced, and the actual code will look like:

    sw   a2,(a0)
    addi a0,a0,4

I have in the past proposed adding simple reg+reg indexing to loads (only), as that fits within the 2r1w framework and loads are more numerous than stores. Similarly, stores could write the effective address back to the base register, which would also fit in the 2r1w framework.

This would allow something like ...

    void arrAdd(int *dst, int *src1, int *src2, int n) {
      for (int i=0; i<n; ++i)
        dst[i] = src1[i] + src2[i];
    }

... to be compiled as ...

    arrAdd:
        sh2add a3,a3,a0 // just past end of dst
        addi   a0,a0,-4 // bias for incr by 4 in the store
        sub    a1,a1,a0 // bias for indexed addressing
        sub    a2,a2,a0
    1:
        lw     a4,(a1,a0)
        lw     a5,(a2,a0)
        add    a4,a4,a5
        swwb   a4,4(a0) // store AND a0 = a0+4
        blt    a0,a3,1b
        ret

This is the same number of instructions as now (no code size saving) but three fewer instructions in the loop.

> For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.

It gives optimal price-performance. Having three read ports is rather expensive, and nowhere near justified by the number of integer instructions that would ever actually make use of the 3rd port, even if you added indexed stores, conditional selects, double-width shifts with dynamic shift count, multiply-add. Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code and besides IEEE rules require a single rounding in a multiply-add.

zozbot234 · on Jan 20, 2024

> Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code

The issue with FMADD is that it requires 4 registers which takes up way too much encoding space. It could've been specified in the usual 3 register format as Fd ← Fd + (Fs1 * Fs2) , which is how the instruction is typically implemented in DSPs. If you really need a separate Fs3 register, that's just a two-insn sequence Fd ← Fs3; Fd ← Fd + (Fs1 * Fs2) which ought to be highly amenable to insn fusion. I do wonder why it wasn't done that way originally, and why the choice was made to use so much space with a new R4 format. (Mind you, this could be fixed by standardizing the new form, adding a new Z-subset of the F extension without the FMADD encoding blocks, and soft-deprecating the original F. But still.)

brucehoult · on Jan 20, 2024

Using up a lot of encoding space for four registers is indeed a problem, but it's a different problem. Even if you restrict the instruction to Fd ← Fd + (Fs1 * Fs2) and/or Fd ← Fs1 + (Fd * Fs2) (the RISC-V Vector extension provides both) you still have to read three registers.

mbitsnbites · on Jan 20, 2024

> 1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`

Yes, as I said.

Please note that the code was generated by GCC (trunk) that has fairly mature RISC-V support (it has had RISC-V support since 2017 at least, and the fusion optimization was suggested even earlier than that). If fusion of load/store instructions is indeed an important part of the RISC-V design, why has this not been fixed in the compiler during the last 6+ years? I take it as there is no hard canonical way that register allocation should be done in these situations, which means that HW implementations must assume that such a sequence may clobber two registers.

> 2) since late 2021 RISC-V has the `sh2add` instruction

Which is great. As you say, not really a win in terms of code density, but definitely fewer degrees of freedom and only one clobbered register. Still not as good as dedicated indexed load/store instructions, though.

BTW, while I agree that an indexed load without indexed store would be better than nothing (I actually considered that for MRISC32), I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).

> 3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code

True that GCC happily and often transforms indexing inside loops to address increments rather than index increments (even so for architectures that has support for indexed addressing).

The point of the example, though, was to highlight that when the compiler needs to do indexed addressing, it must clobber unrelated registers, which is a problem for instruction fusion (i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects).

> It gives optimal price-performance.

Probably. But this circles back to the question about what segments you target with your design. I am fully convinced that RISC-V is the best ISA choice for many segments, especially the cost sensitive segments. However I'm less convinced that RISC-V (as it is) is the optimal choice for less cost sensitive high performance segments (e.g. premium products like those from Apple, gaming riggs, compute servers, and so on). It can probably do a good job there (just like x86 can do a good job, despite all its ISA bloat and shortcomings), but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments (at least that is my belief).

brucehoult · on Jan 20, 2024

> Yes, as I said.

Not at the point I was composing my reply.

> If fusion of load/store instructions is indeed an important part of the RISC-V design

I don't believe that it is. Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).

I have heard that some of the companies making very high performance cores (Apple M or AMD Zen class) are incorporating some fusion. Perhaps they will provide compiler patches if required.

> I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).

I think they would manage. There is certainly precedent e.g. MSP430 has 4 addressing modes for src operands but only 2 for dst (register, register indirect with 16 bit offset). Plus of course every ISA where its an addressing mode not a different instruction allows "immediate" on a load but not on a store.

> i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects

That can never happen. Fusion will simply not be done in that case.

> but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments

And I don't think the difference would be enough to measure or care about — single digit percentage on number of clock cycles needed. And meanwhile the simpler core may allow at least the same percentage to be gained as higher clock speed, or more cores on a die, or other compensating factors.

camel-cdr · on Jan 20, 2024

> Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).

You can play around with OpenXianShan though, they have a few fusion targets: https://github.com/OpenXiangShan/XiangShan/blob/master/src/m...

Most of the targets require the same destination, so it won't be able to fuse current codegen. I suppose there is still some time before compilers need to be ready, but it's not that much.

> Perhaps they will provide compiler patches if required.

I hope so, btw t-head seems to be still be trying to upstream XTheadVector: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/64278...

camel-cdr · on Jan 20, 2024

Looks like llvm recently got some RISC-V fusion support via -mtune: https://github.com/llvm/llvm-project/commits/main/llvm/lib/T...