The RISC-V authors rely heavily on the argument and assumption that any high performance implementation will do instruction fusion, and the lack of indexed addressing is precisely one of the things that is expected to be fused (see page 6 in https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130...).
Any implementation that does not do instruction fusion (most implementations - especially lower end ones) are out of luck, and have to pay the penalty of executing extra instructions.
All modern high performance ISA:s include indexed addressing modes (most also have scaled indexed addressing). Some RISC-V implementations add these as non-standard extensions.
RISC-V was not designed primarily for high performance. It targets the extremely low end (education, microcontrollers), and does it really well, and they have this clever trick (compression + instruction fusion) that makes the situation OK:ish for high end implementations. But it's not ideal.
Also worth noting is that instruction fusion is far from free. It may be negligible in a really big implementation that already does things like instruction translation (uOP caches etc), but it sure feels like an unnecessary burden to bring to an ISA that hasn't even had the time to build up a legacy baggage yet.
Furthermore, fused instructions often "leak" data to architectural registers (e.g. temporary registers used for address calculation), which both increases register pressure and requires that fused internal instructions must produce extra outputs (often unnecessarily).
void store(int* array, int index, int value) {
array[index] = value;
}
MRISC32:
store(int*, int, int):
stw r3, [r1, r2*4]
ret
RISC-V:
store(int*, int, int):
slli a1,a1,2 // writes to a1
add a0,a0,a1 // writes to a0
sw a2,0(a0)
ret
Note how the RISC-V version clobbers both the A1 and the A0 registers. Thus, a properly fused instruction would have to read from three registers and write to two register (that's bad!), whereas the MRISC32 solution only has to read three registers - no registers need to be written to (same thing for ARM and others). (You could teach the compiler to do an "add a1,a0,a1" instead so that only one register is clobbered, but that adds complexity to the compiler, and the micro architecture may not be able to rely such compiler optimizations but would most likely have to support the case with two clobbered registers anyway).
Edit 2: Another main reason for not including indexed addressing modes in RISC-V is that indexed stores require three source registers (i.e. at least three register file read ports are needed) - which is a big no-no in the RISC-V philosophy. For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.
Minor note - RISC-V's Zba (included in RVA22, which afaik is set to be the baseline requirement for Android) has sh2add for a combined x+y*4, getting that store down to two instructions (testable with -march=rv64gczba); doesn't affect the overall point of needless clobbering of registers from the instruction fusion approach though (and shouldn't indexed loads be even worse, having three potential output registers in baseline RV64GC, and still two with Zba?)
That's the same approach that gets you from 2 to 1 clobbered registers for stores, no? i.e. usually possible, but needs compiler participation; if assuming the compiler will optimize load register usage properly, it shouldn't have any problem reducing the store clobbers from 2 to 1 either, which I understood as the difficult part. (or course 1 clobber is still worse than 0, but at least it means no 2 writes/fused instr, at least here)
> store(int*, int, int):
> slli a1,a1,2 // writes to a1
> add a0,a0,a1 // writes to a0
> sw a2,0(a0)
> ret
Three points:
1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`
2) since late 2021 RISC-V has the `sh2add` instruction (and shifts of 1 and 3, and also versions that widen and shift a 32 bit unsigned value on a 64 bit CPU) that replaces the `slli` and `add`, thus naturally clobbering only one register. These instructions are already present in the popular RISC-V boards that use the JH7110 SoC (VisionFive 2, Star64, Mars, PineTab-V, Roma)
3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code. If you care about the performance it's going to be because it's in a loop, and the function will be inlined, and strength-reduced, and the actual code will look like:
sw a2,(a0)
addi a0,a0,4
I have in the past proposed adding simple reg+reg indexing to loads (only), as that fits within the 2r1w framework and loads are more numerous than stores. Similarly, stores could write the effective address back to the base register, which would also fit in the 2r1w framework.
This would allow something like ...
void arrAdd(int *dst, int *src1, int *src2, int n) {
for (int i=0; i<n; ++i)
dst[i] = src1[i] + src2[i];
}
... to be compiled as ...
arrAdd:
sh2add a3,a3,a0 // just past end of dst
addi a0,a0,-4 // bias for incr by 4 in the store
sub a1,a1,a0 // bias for indexed addressing
sub a2,a2,a0
1:
lw a4,(a1,a0)
lw a5,(a2,a0)
add a4,a4,a5
swwb a4,4(a0) // store AND a0 = a0+4
blt a0,a3,1b
ret
This is the same number of instructions as now (no code size saving) but three fewer instructions in the loop.
> For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.
It gives optimal price-performance. Having three read ports is rather expensive, and nowhere near justified by the number of integer instructions that would ever actually make use of the 3rd port, even if you added indexed stores, conditional selects, double-width shifts with dynamic shift count, multiply-add. Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code and besides IEEE rules require a single rounding in a multiply-add.
> Three read ports are justified in FP code as multiply-add is the fundamental and by far most popular instruction in FP code
The issue with FMADD is that it requires 4 registers which takes up way too much encoding space. It could've been specified in the usual 3 register format as Fd ← Fd + (Fs1 * Fs2) , which is how the instruction is typically implemented in DSPs. If you really need a separate Fs3 register, that's just a two-insn sequence Fd ← Fs3; Fd ← Fd + (Fs1 * Fs2) which ought to be highly amenable to insn fusion. I do wonder why it wasn't done that way originally, and why the choice was made to use so much space with a new R4 format. (Mind you, this could be fixed by standardizing the new form, adding a new Z-subset of the F extension without the FMADD encoding blocks, and soft-deprecating the original F. But still.)
Using up a lot of encoding space for four registers is indeed a problem, but it's a different problem. Even if you restrict the instruction to Fd ← Fd + (Fs1 * Fs2) and/or Fd ← Fs1 + (Fd * Fs2) (the RISC-V Vector extension provides both) you still have to read three registers.
> 1) you could argue that's bad code generation, which could easily be fixed in the compiler, as `a1` could be used as the result of the `add`
Yes, as I said.
Please note that the code was generated by GCC (trunk) that has fairly mature RISC-V support (it has had RISC-V support since 2017 at least, and the fusion optimization was suggested even earlier than that). If fusion of load/store instructions is indeed an important part of the RISC-V design, why has this not been fixed in the compiler during the last 6+ years? I take it as there is no hard canonical way that register allocation should be done in these situations, which means that HW implementations must assume that such a sequence may clobber two registers.
> 2) since late 2021 RISC-V has the `sh2add` instruction
Which is great. As you say, not really a win in terms of code density, but definitely fewer degrees of freedom and only one clobbered register. Still not as good as dedicated indexed load/store instructions, though.
BTW, while I agree that an indexed load without indexed store would be better than nothing (I actually considered that for MRISC32), I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).
> 3) using a function like this to prove anything actually verges on dishonest (and erincandescent did the same thing) because no real program where anyone cares about performance will have such a function in hot code
True that GCC happily and often transforms indexing inside loops to address increments rather than index increments (even so for architectures that has support for indexed addressing).
The point of the example, though, was to highlight that when the compiler needs to do indexed addressing, it must clobber unrelated registers, which is a problem for instruction fusion (i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects).
> It gives optimal price-performance.
Probably. But this circles back to the question about what segments you target with your design. I am fully convinced that RISC-V is the best ISA choice for many segments, especially the cost sensitive segments. However I'm less convinced that RISC-V (as it is) is the optimal choice for less cost sensitive high performance segments (e.g. premium products like those from Apple, gaming riggs, compute servers, and so on). It can probably do a good job there (just like x86 can do a good job, despite all its ISA bloat and shortcomings), but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments (at least that is my belief).
> If fusion of load/store instructions is indeed an important part of the RISC-V design
I don't believe that it is. Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).
I have heard that some of the companies making very high performance cores (Apple M or AMD Zen class) are incorporating some fusion. Perhaps they will provide compiler patches if required.
> I think that it would pose a problem for compilers that generally assume symmetry (addressing modes supported by loads are assumed to be supported by stores too).
I think they would manage. There is certainly precedent e.g. MSP430 has 4 addressing modes for src operands but only 2 for dst (register, register indirect with 16 bit offset). Plus of course every ISA where its an addressing mode not a different instruction allows "immediate" on a load but not on a store.
> i.e. a fused instruction does not do the same thing as a corresponding architectural instruction, as the fused instruction has more side effects
That can never happen. Fusion will simply not be done in that case.
> but if the integer ISA was designed with support for three source operands from the start, it would be an even better fit for those segments
And I don't think the difference would be enough to measure or care about — single digit percentage on number of clock cycles needed. And meanwhile the simpler core may allow at least the same percentage to be gained as higher clock speed, or more cores on a die, or other compensating factors.
> Certainly no RISC-V implementations that are in the hands of customers right now do any fusion and it doesn't seem to hurt their ability to match or exceed the performance of similar Arm cores (A55, A72).
Most of the targets require the same destination, so it won't be able to fuse current codegen. I suppose there is still some time before compilers need to be ready, but it's not that much.
> Perhaps they will provide compiler patches if required.
Any implementation that does not do instruction fusion (most implementations - especially lower end ones) are out of luck, and have to pay the penalty of executing extra instructions.
All modern high performance ISA:s include indexed addressing modes (most also have scaled indexed addressing). Some RISC-V implementations add these as non-standard extensions.
RISC-V was not designed primarily for high performance. It targets the extremely low end (education, microcontrollers), and does it really well, and they have this clever trick (compression + instruction fusion) that makes the situation OK:ish for high end implementations. But it's not ideal.
Also worth noting is that instruction fusion is far from free. It may be negligible in a really big implementation that already does things like instruction translation (uOP caches etc), but it sure feels like an unnecessary burden to bring to an ISA that hasn't even had the time to build up a legacy baggage yet.
Furthermore, fused instructions often "leak" data to architectural registers (e.g. temporary registers used for address calculation), which both increases register pressure and requires that fused internal instructions must produce extra outputs (often unnecessarily).
Edit: Consider the following code (https://godbolt.org/z/8YdfoT34o):
MRISC32: RISC-V: Note how the RISC-V version clobbers both the A1 and the A0 registers. Thus, a properly fused instruction would have to read from three registers and write to two register (that's bad!), whereas the MRISC32 solution only has to read three registers - no registers need to be written to (same thing for ARM and others). (You could teach the compiler to do an "add a1,a0,a1" instead so that only one register is clobbered, but that adds complexity to the compiler, and the micro architecture may not be able to rely such compiler optimizations but would most likely have to support the case with two clobbered registers anyway).Edit 2: Another main reason for not including indexed addressing modes in RISC-V is that indexed stores require three source registers (i.e. at least three register file read ports are needed) - which is a big no-no in the RISC-V philosophy. For any implementation above the purely educational level, 3+ register file read ports is the norm, but AFAIK the RISC-V authors have claimed that the base integer ISA will never require more than two source operands. I think that this philosophy weighs very heavily - more so than optimal performance.