I wonder how many counterparts to delay slots, stack windows, conditional moves, and other embarrassments we are inadvertently enshrining. There's nothing like hindsight to make you facepalm. (The crypto extension is my bet ATM for most-likely-to-embarrass. But that's without reading it.)
The only way to approach this project sensibly is to assume every single FPGA produced after some near future point will have at least one, and more typically dozens of RISC-V cores scattered around like the multipliers you see in them now, just to try to be competitive.
Personally, I am banking my enthusiasm for when the Bitmanip extension goes in.
Compressed versions of the load/store instructions don't support 8bit or 16bit values, so in situations where you most care about memory usage you're forced to use the full size versions.
What's wrong with conditional moves? They're good for mispredictable branches, although I liked them better on PPC which had 8 condition/flags registers instead of just 1.
Conditional moves depend on state left over from the last instruction. But when you want to re-order your instructions to keep all your functional units busy, keeping the conditional moves right makes a big mess. They're fine as micro-ops after you've scheduled them all, but are lousy at the ISA level.
Intel and AMD make it work by throwing another 10,000 or 100,000 transistors at it.
It is better to let macro-op fusion hardware identify opportunities to convert a branch-over-move sequence, all by itself. RISC-V is supposed to be all about powerful macro-op fusion.
Clang is really aggressive about generating cmovs. On Gcc you can still use (x & -c) expressions to get nicely pipelined conditional expressions, but Clang stomps them all to cmovs.
Cmov is one of the methods to mitigate Spectre because Intel refuses to speculate loads in them. So, cmov from memory pessimizes your code in cases where you aren't worried what might be sharing your cache.
> Cmov is one of the methods to mitigate Spectre because Intel refuses to speculate loads in them.
Sure, this is what makes it good for mispredictable branches - data decompression for instance. If you can't speculate ahead, it's a waste of space in the branch prediction buffer to do it.
But I definitely wish it was a builtin function instead of the compiler generating it, because none of them can guess when to use it.
> Conditional moves depend on state left over from the last instruction.
So, literally exactly the same state that conditional branch instructions depend on? Aside from implementation details (like >Intel refuses to speculate loads in them<, which, to be fair, might be its own problem), "cmovz D S" is just "jnz skip ; mov D S ; skip:" without the useless branch overhead.
Performance is 100% implementation details, from top to bottom. And the only reason anybody uses a cmov instruction is for hoped-for better performance.
There is hope that it doesn't squat a precious branch-prediction slot, hope that it doesn't incur a branch misprediction pipeline stall if it's taken, or not taken, hope that fewer instructions translates to fewer clock cycles, hope that what looks like a copy really just does a register-renaming accounting trick, ...
Having 3 input operands is definitely an issue. In the Alpha EV6 architecture, it was decided only 2 inputs are supported, so a CMOV would need to be split into two actual instructions, the first of which setting the 65th bit of one of the operands to the result of the test and the other selecting the correct output. Thus, CMOV had 2 cycles latency and still required extra hardware resources to implement.
That one of the operands is the flag register, affected by other instructions you might want to schedule. RISC-V might have its own way to mitigate this.
Unless I've understood you poorly, this is confused.
The issue with CMOV is primarily that it's a three operand instruction. If it was a two-instruction operation, it would just be arithmetic, and nobody would care.
A minor, secondary issue is that conditional moves have a data dependency on both of the source operands, unlike conditional branch and move, which means that they are often slower than a predictable branch and move on a fast processor. However, this doesn't matter all that much, because they're optional, so just don't use them when they aren't a good fit.
I dislike the lack of arithmetic instructions with built-in trap on overflow (even imprecise with a fence could be fine), but maybe that's too expensive to do in hardware.
FYI, here is the rationale section on that subject:
We did not include special instruction set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition: add t0, t1, t2; bltu t0, t1, overflow.
For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case of addition with an immediate operand.
For general signed addition, three additional instructions after the addition are required, leveraging the observation that the sum should be less than one of the operands if and only if the other operand is negative.
This is kind of a poor argument, since it applies equally to any request to add instructions to RISC that can be implemented by other instructions. A 4:1 increase in cost isn't brilliant either.
The problem is that few high level languages support this well either. Having an instruction that traps like division by zero isn't great, because it's quite expensive to turn a machine trap into a high-level exception.
What I would like to try is a separate "addsat" instruction which (a) provides saturating arithmetic and (b) sets a flag if saturation has applied, but does not clear it if it has not been applied. That allows you to write a bunch of arithmetic in a natural way (a1b1 + a2b2 - a3*b3) etc, and only at the end test for overflow. It would behave more like a propagating integer NaN. Since it's a flag rather than a trap a high-level language can more easily convert it to an exception, or provide other processing.
(Saturation arithmetic is useful on its own in some cases, but usually for 8 or 16 bit values)
See also these topics: macro-op fusion, compressed instructions, extensions.
RISC-V publishes (at least for some cores) the preferred order that you have to emit instructions in order for those to be recognized and fused into efficient micro ops. The compressed (C) extension means that instructions don't take up too much extra space in the I-cache.
Of course we don't know -- because very high performance RISC-V chips don't (yet) exist -- if this will really work when the silicon hits the road, but the plan makes some sense and has the (IMHO) large advantage that it keeps the standard small and easy to implement (in those cases where you don't much care about performance).
There are also extensions, which RISC-V formalizes properly. You can with relative ease add your own addsat instruction through a private extension, and (with somewhat more difficulty) propose a public extension if this is generally useful.
For an ISA whose main use will be IoT (at least in the beginning), this isn't good..
Rust didn't put integer overflow checks in release mode because current CPU don't provide "free" integer overflow detection, RISC V is even a regression over MIPS here..
> Rust didn't put integer overflow checks in release mode because current CPU don't provide "free" integer overflow detection
Not really; the main performance drawback of obligate overflow checks is excessively-constrained semantics of the resulting code making optimization harder, not the direct CPU cost of checking. There are other ways of detecting possible overflows, and Rust does check for overflow in debug mode (this is specifically in order to ensure that most Rust code will work properly even when checks are enabled, and will not simply end up relying on underspecified semantics.)
Connecting from one word to the next. Do you have to do 32-bit multiplications so as to retain the high half of the result, or does a 64-bit multiply deliver the high 64 bits of the result somewhere?
Also there is stuff about the carry from 64-bit addition.
Specifically, I think the "sltu" (set if less than, unsigned) instruction is useful for carries. When you add two unsigned integers, the result is smaller than the inputs if and only if an overflow occurred.
; a1:a0 * a2 --> a5:a4:a3
; using t0 as temp reg
mulhu a5, a1, a2
mulu a4, a1, a2
mulhu t0, a0, a2
mulu a3, a0, a2
add a4, a4, t0
; carry a 1 or a 0, overwriting t0
sltu t0, a4, t0
add a5, a5, t0
I know nothing about RISC-V, but my usual way of finding out how to do something standard with a given instruction set is to ask a compiler. There are online compilers, such as this one: https://cx.rv8.io/ Just type in "__int128 add(__int128 a, __int128 b) { return a + b; }" on the left, stick "-O2" into the "Compiler options", and see what you get! I don't know whether that particular compiler is doing the right thing, of course, but it gives you an idea and it's definitely less work than reading the spec.
I wonder how many counterparts to delay slots, stack windows, conditional moves, and other embarrassments we are inadvertently enshrining. There's nothing like hindsight to make you facepalm. (The crypto extension is my bet ATM for most-likely-to-embarrass. But that's without reading it.)
The only way to approach this project sensibly is to assume every single FPGA produced after some near future point will have at least one, and more typically dozens of RISC-V cores scattered around like the multipliers you see in them now, just to try to be competitive.
Personally, I am banking my enthusiasm for when the Bitmanip extension goes in.