RISC-V formal spec public review

ncmncm · on June 21, 2019

Well, this is progress.

I wonder how many counterparts to delay slots, stack windows, conditional moves, and other embarrassments we are inadvertently enshrining. There's nothing like hindsight to make you facepalm. (The crypto extension is my bet ATM for most-likely-to-embarrass. But that's without reading it.)

The only way to approach this project sensibly is to assume every single FPGA produced after some near future point will have at least one, and more typically dozens of RISC-V cores scattered around like the multipliers you see in them now, just to try to be competitive.

Personally, I am banking my enthusiasm for when the Bitmanip extension goes in.

mrob · on June 21, 2019

Compressed versions of the load/store instructions don't support 8bit or 16bit values, so in situations where you most care about memory usage you're forced to use the full size versions.

astrange · on June 21, 2019

What's wrong with conditional moves? They're good for mispredictable branches, although I liked them better on PPC which had 8 condition/flags registers instead of just 1.

ncmncm · on June 21, 2019

Conditional moves depend on state left over from the last instruction. But when you want to re-order your instructions to keep all your functional units busy, keeping the conditional moves right makes a big mess. They're fine as micro-ops after you've scheduled them all, but are lousy at the ISA level.

Intel and AMD make it work by throwing another 10,000 or 100,000 transistors at it.

It is better to let macro-op fusion hardware identify opportunities to convert a branch-over-move sequence, all by itself. RISC-V is supposed to be all about powerful macro-op fusion.

Clang is really aggressive about generating cmovs. On Gcc you can still use (x & -c) expressions to get nicely pipelined conditional expressions, but Clang stomps them all to cmovs.

Cmov is one of the methods to mitigate Spectre because Intel refuses to speculate loads in them. So, cmov from memory pessimizes your code in cases where you aren't worried what might be sharing your cache.

astrange · on June 21, 2019

> Cmov is one of the methods to mitigate Spectre because Intel refuses to speculate loads in them.

Sure, this is what makes it good for mispredictable branches - data decompression for instance. If you can't speculate ahead, it's a waste of space in the branch prediction buffer to do it.

But I definitely wish it was a builtin function instead of the compiler generating it, because none of them can guess when to use it.

fluffything · on June 21, 2019

They can't guess, which is why they all offer `__builtin_expect` to control them (which people often wrap in LIKELY/UNLIKELY macros).

astrange · on June 21, 2019

Yes, but we need the opposite. There's no way to say a branch has no expected value.

dooglius · on June 21, 2019

Clang does have __builtin_unpredictable, but last I checked it doesn't actually seem to have an effect.

a1369209993 · on June 21, 2019

> Conditional moves depend on state left over from the last instruction.

So, literally exactly the same state that conditional branch instructions depend on? Aside from implementation details (like >Intel refuses to speculate loads in them<, which, to be fair, might be its own problem), "cmovz D S" is just "jnz skip ; mov D S ; skip:" without the useless branch overhead.

ncmncm · on June 21, 2019

Branches are in a different category than moves and ALU ops, for instruction reordering.

a1369209993 · on June 21, 2019

> Aside from implementation details

ncmncm · on June 22, 2019

Performance is 100% implementation details, from top to bottom. And the only reason anybody uses a cmov instruction is for hoped-for better performance.

There is hope that it doesn't squat a precious branch-prediction slot, hope that it doesn't incur a branch misprediction pipeline stall if it's taken, or not taken, hope that fewer instructions translates to fewer clock cycles, hope that what looks like a copy really just does a register-renaming accounting trick, ...

devit · on June 21, 2019

Isn't cmov just a normal instruction with 3 source operands?

Or do you mean that the problem is having 3 instead of 2 source operands?

Or that one of the operands is the flag register on x86-like architecture? (but you can just use a normal register being nonzero)

mellum · on June 21, 2019

Having 3 input operands is definitely an issue. In the Alpha EV6 architecture, it was decided only 2 inputs are supported, so a CMOV would need to be split into two actual instructions, the first of which setting the 65th bit of one of the operands to the result of the test and the other selecting the correct output. Thus, CMOV had 2 cycles latency and still required extra hardware resources to implement.

ncmncm · on June 21, 2019

That one of the operands is the flag register, affected by other instructions you might want to schedule. RISC-V might have its own way to mitigate this.

Veedrac · on June 21, 2019

Unless I've understood you poorly, this is confused.

The issue with CMOV is primarily that it's a three operand instruction. If it was a two-instruction operation, it would just be arithmetic, and nobody would care.

A minor, secondary issue is that conditional moves have a data dependency on both of the source operands, unlike conditional branch and move, which means that they are often slower than a predictable branch and move on a fast processor. However, this doesn't matter all that much, because they're optional, so just don't use them when they aren't a good fit.

ncmncm · on June 21, 2019

It probably is confused.

devit · on June 21, 2019

I dislike the lack of arithmetic instructions with built-in trap on overflow (even imprecise with a fence could be fine), but maybe that's too expensive to do in hardware.

waterhouse · on June 21, 2019

FYI, here is the rationale section on that subject:

We did not include special instruction set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition: add t0, t1, t2; bltu t0, t1, overflow.

For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case of addition with an immediate operand.

For general signed addition, three additional instructions after the addition are required, leveraging the observation that the sum should be less than one of the operands if and only if the other operand is negative.

              add t0, t1, t2
              slti t3, t2, 0
              slt t4, t0, t1
              bne t3, t4, overflow

In RV64, checks of 32-bit signed additions can be optimized further by comparing the results of ADD and ADDW on the operands.

pjc50 · on June 21, 2019

This is kind of a poor argument, since it applies equally to any request to add instructions to RISC that can be implemented by other instructions. A 4:1 increase in cost isn't brilliant either.

The problem is that few high level languages support this well either. Having an instruction that traps like division by zero isn't great, because it's quite expensive to turn a machine trap into a high-level exception.

What I would like to try is a separate "addsat" instruction which (a) provides saturating arithmetic and (b) sets a flag if saturation has applied, but does not clear it if it has not been applied. That allows you to write a bunch of arithmetic in a natural way (a1b1 + a2b2 - a3*b3) etc, and only at the end test for overflow. It would behave more like a propagating integer NaN. Since it's a flag rather than a trap a high-level language can more easily convert it to an exception, or provide other processing.

(Saturation arithmetic is useful on its own in some cases, but usually for 8 or 16 bit values)

rwmj · on June 21, 2019

See also these topics: macro-op fusion, compressed instructions, extensions.

RISC-V publishes (at least for some cores) the preferred order that you have to emit instructions in order for those to be recognized and fused into efficient micro ops. The compressed (C) extension means that instructions don't take up too much extra space in the I-cache.

Of course we don't know -- because very high performance RISC-V chips don't (yet) exist -- if this will really work when the silicon hits the road, but the plan makes some sense and has the (IMHO) large advantage that it keeps the standard small and easy to implement (in those cases where you don't much care about performance).

There are also extensions, which RISC-V formalizes properly. You can with relative ease add your own addsat instruction through a private extension, and (with somewhat more difficulty) propose a public extension if this is generally useful.

renox · on June 21, 2019

For an ISA whose main use will be IoT (at least in the beginning), this isn't good..

Rust didn't put integer overflow checks in release mode because current CPU don't provide "free" integer overflow detection, RISC V is even a regression over MIPS here..

0815test · on June 21, 2019

> Rust didn't put integer overflow checks in release mode because current CPU don't provide "free" integer overflow detection

Not really; the main performance drawback of obligate overflow checks is excessively-constrained semantics of the resulting code making optimization harder, not the direct CPU cost of checking. There are other ways of detecting possible overflows, and Rust does check for overflow in debug mode (this is specifically in order to ensure that most Rust code will work properly even when checks are enabled, and will not simply end up relying on underspecified semantics.)

renox · on June 22, 2019

Interesting, do you have studies on this topic?

That said the 'CPU costs' of adding these branch are mostly ICache usage cost so benchmarks aren't very useful..

nagisa · on June 21, 2019

Is over.

> The Public Review period is: March 29, 2019 through May 13, 2019

edflsafoiewq · on June 21, 2019

How easy is it to do bigint arithmetic in RISC-V?

Dylan16807 · on June 21, 2019

It has perfectly good add and multiply instructions, so uh easy? What at an ISA level would make it hard?

ncmncm · on June 21, 2019

Connecting from one word to the next. Do you have to do 32-bit multiplications so as to retain the high half of the result, or does a 64-bit multiply deliver the high 64 bits of the result somewhere?

Also there is stuff about the carry from 64-bit addition.

(Substitute 16 and 32, for 32-bit units.)

Dylan16807 · on June 21, 2019

There are instructions to get the top half of a multiply, and a suggested ordering so that the chip can deliver both halves with only one calculation.

There's no carry flag but I'd say it's still easy to work around that.

waterhouse · on June 21, 2019

Specifically, I think the "sltu" (set if less than, unsigned) instruction is useful for carries. When you add two unsigned integers, the result is smaller than the inputs if and only if an overflow occurred.

  ; a1:a0 * a2 --> a5:a4:a3
  ; using t0 as temp reg
  mulhu a5, a1, a2
  mulu  a4, a1, a2
  mulhu t0, a0, a2
  mulu  a3, a0, a2
  add   a4, a4, t0
  ; carry a 1 or a 0, overwriting t0
  sltu  t0, a4, t0
  add   a5, a5, t0

(edit: switched a1 and a0 for consistency)

edflsafoiewq · on June 21, 2019

On x86 you'd do a bigint add with an ADC chain. In RISC-V I think it would be a lot more branchy.

bloak · on June 21, 2019

I know nothing about RISC-V, but my usual way of finding out how to do something standard with a given instruction set is to ask a compiler. There are online compilers, such as this one: https://cx.rv8.io/ Just type in "__int128 add(__int128 a, __int128 b) { return a + b; }" on the left, stick "-O2" into the "Compiler options", and see what you get! I don't know whether that particular compiler is doing the right thing, of course, but it gives you an idea and it's definitely less work than reading the spec.

waterhouse · on June 21, 2019

No need for branches, see comment above demonstrating sltu.

For a long carry chain, I guess it might look like this:

  ; Inputs: x = 4-word number, in a3:a2:a1:a0,
  ;         y = 1-word number, in a4.
  ; Output: x+y, 5-word number, in a4:a3:a2:a1:a0.
  add  a0, a0, a4
  sltu a4, a0, a4 ; a4 = carry bit
  add a1, a1, a4
  sltu a4, a1, a4 ; a4 = carry bit again
  add a2, a2, a4
  sltu a4, a2, a4
  add a3, a3, a4
  sltu a4, a3, a4

edflsafoiewq · on June 21, 2019

Ah, nice, thank you! I must have missed sltu.