An ex-ARM engineer critiques RISC-V

pizlonator · on Nov 1, 2020

The lack of condition codes is a big deal for anyone relying on overflow checked arithmetic, like modern safe languages that do this for all integer math by default, or dynamic languages where it’s necessary for the JIT to speculate that the dynamic “number” type (which in those languages is either like a double or like a bigint semantically) is being used as an integer.

RISC-V means three instructions instead of two in the best case. It requires five or more instead of two in bad cases. That’s extremely annoying since these code sequences will be emitted frequently if that’s how all math in the language works.

Also worth noting, since this always comes up, that these things are super hard for a compiler to optimize away. JSC tries very aggressively but only succeeds a minority of the time (we have a backwards abstract interpreter based on how values are used, a forward interpreter that uses a simplified octagon domain to prove integer ranges, and a bunch of other things - and it’s not enough). So, even with very aggressive compilers you will often emit sequences to check overflow. It’s ideal if this is just a branch on a flag because this maximizes density. Density is especially important in JITs; more density means not just better perf but better memory usage since JITed instructions use dirty memory.

rwmj · on Nov 1, 2020

The idea is that high end processors will recognise these sequences of instructions and optimize them (something called macro-op fusion). Whether this is a good idea is an open question because we don't yet have such high performance RISC-V chips, but that's what the RISC-V designers are thinking. At the same time it permits very simple implementations which wouldn't be possible if the base instruction set contained every pet instruction that someone thought was a good idea.

Note macro op fusion is widely used for other architectures already, particularly ones like x86 where what the processor actually runs looks nothing like the machine code.

pizlonator · on Nov 1, 2020

Two words: instruction density.

It doesn’t matter if they’re fused or not if the reduced instruction density increases memory usage and puts more pressure on I$.

Also, I don’t buy the whole fusion argument on the grounds that having to fuse super complex (5 instruction or more) sequences adds enough complexity that you’ve got opportunity cost. Much better for everyone if the CPU doesn’t have to do that fusion. That’s the whole point of good ISA design - to prevent the need for fusing in cases you’re doing something super common.

rwmj · on Nov 1, 2020

That's what the RISC-V compressed instruction encoding is all about. There is a paper which I can't find right now about how the compressed encoding achieves something similar to x86 code size on typical application code. As I said above, the jury is still out until we get very high performance RISC-V implementations which are equivalent to existing high end x86 and aarch64 designs.

Edit: Here's the thesis about the design decisions in the C encoding: https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.p... See also the diagram on page 62 of this document: https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

pizlonator · on Nov 1, 2020

First of all: I agree with you that the jury is out. We are all speculating. I might be wrong.

The thing I see is this: you can also add compressed encodings for any ISA. And that has its own costs (it’s harder to decode and it’s harder on software that wants to do bidirectional analysis of machine code). So “my isa has shortcomings but it’s cool because compression” isn’t a perfect argument since if your isa lacks those shortcomings then you still benefit from compression and you don’t need it as much, which is better.

andrekandre · on Nov 1, 2020

> it’s harder to decode and it’s harder on software that wants to do bidirectional analysis of machine code

not necessarily, since

1. the compressed versions are basically 1-1 to the non-compressed versions

2. the uncompressed versions are more for study/academia and clarity; its expected that irl only compressed instructions are used (for instructions that can be compressed)

more info: https://riscv.org/wp-content/uploads/2015/11/riscv-compresse...

anp · on Nov 1, 2020

Do any other ISAs have compressed encodings in wide use? It seems a bit like a chicken and egg problem - why build all the silicon to handle decoding it if your base ISA is dense enough?

pizlonator · on Nov 1, 2020

Arguably x86 only has a compressed encoding in the sense that super common instructions often fit in fewer than 4 bytes and some of the most used ones are 1 byte. I don’t think the compression is deliberate or optimal though.

saagarjha · on Nov 1, 2020

Arguably most of the 1 byte instructions are garbage, though ;)

rwmj · on Nov 2, 2020

What, no love for AAA :-?

(IIRC they removed that one on x86-64)

darknoon · on Nov 1, 2020

Sounds pretty similar to thumb mode on ARM, no?

mark_mcs · on Nov 1, 2020

And ARM licensed the Hitachi SuperH patents when implementing Thumb. It's not exactly a new concept.

fulafel · on Nov 6, 2020

In addition to Thumb there are apparently at least MIPS16e ASE and PowerPC VLE.

loeg · on Nov 1, 2020

ARM had Thumb, right? Is that a deadend at this point?

deaddodo · on Nov 2, 2020

Thumb was removed from ARMv8 (AArch64). If you use 32-but legacy, it may exist on some chips; but it’s no longer the benefit it once was.

hajile · on Nov 2, 2020

It's always been of the same value. You exchange performance for code density. They want aarch64 to be high-performance, so they removed thumb.

RISC-V compact instructions don't require special modes and run in fully-mixed mode with 32-bit instructions without all the penalties thumb has (they are literally just extended into their 32-bit counterparts internally).

deaddodo · on Nov 2, 2020

> It's always been of the same value. You exchange performance for code density.

High code density became less valuable, was the point.

Veedrac · on Nov 1, 2020

> having to fuse super complex (5 instruction or more) sequences

Can you give an example of someone advocating for 5 instruction fusion? Normally it's limited to three.

jhallenworld · on Nov 1, 2020

They have this example for general signed overflow checking in the 2.2 spec (but I'm not sure if this counts as a recommended sequence for fusion):

    add t0, t1, t2
    slti t3, t2, 0
    slt t4, t0, t1
    bne t3, t4, overflow

So two extra registers needed also..

edit: so I now think a good extension to add overflow checking to RISC-V is with an instruction that works like "slt"- call it "sov", set if add would overflow:

    add t0, t1, t2
    sov t3, t1, t2
    bnez t3, overflow

add/sov could be fused..

Veedrac · on Nov 1, 2020

Fusing that would be very difficult, since it's hard to write that many registers with a single op, and I don't think they recommend it. This does mean signed overflow checking will be comparatively expensive. But thanks for the reference, I agree this is a weak point :).

ithkuil · on Nov 1, 2020

You don't have to use real registers in a single op. Fusing means interpreting a pattern in the input stream and issuing a different instruction from a richer microarchitectural instruction set

jcranmer · on Nov 1, 2020

You have to preserve the architectural registers at the end of the sequence. So if there are 5 registers, you either have to have a 5-register microinstruction, or issue multiple 3-register microinstructions instead.

ithkuil · on Nov 1, 2020

You have to preserve architectural registers only when some other instruction actually depends on them. When you detect a dependency you lazily compute their content (causing a stall)

brucehoult · on Nov 1, 2020

The dependency could be arbitrarily many millions of instructions away.

The only way to know there isn't a dependency is if that register gets clobbered by something else very soon afterwards.

But this whole topic of "checking for signed overflow is expensive" is overblown. It's simply not that important an operation, especially in the context of those languages that do it a lot also doing a lot of memory references, which are far more expensive.

Adding arbitrary completely unknown integers is pretty rare. If you know both numbers are greater than zero then a single compare-and-branch is all you need. If one of the numbers is a constant then a single compare-and-branch is all you need.

jcranmer · on Nov 1, 2020

That sounds like an amazingly bad idea, because you would instead have to retain the two source registers until you can prove that you don't need the two output registers anymore, which can be arbitrarily far in the future.

brandmeyer · on Nov 1, 2020

At least in their papers and mailing list discussion, the limit for simple fusion is not on the number of source operands, but on the number of destination operands. A classic example is a PC-relative long jump:

  auipc lr, zero, .LONG_TARGET20
  jalr  lr, lr, .LONG_TARGET12

Only one register (the link register) is clobbered, so the pair can be fused into a single wide jump-and-link.

So in parent's example, sov might fuse with the following bnez, but it likely wouldn't fuse with the preceding addition.

Veedrac · on Nov 2, 2020

Yes, I missed the edit, I was only commenting on the first part.

Quite a few things determine whether a fusion is doable or not. In addition to the number of destination registers, you do, to a more relaxed extent, care about source operands, but also things like ‘does this fit nicely in a single pass through the pipeline?’ and even just ‘is this materially beneficial?’

Lots of cores (but not all) can write two registers from a fused instruction, given the right conditions, and sov does rerun the addition, so add-sov fusion sounds very doable to me.

brandmeyer · on Nov 1, 2020

rotate-and-xor (and xor-and-rotate) are both common operations in ARX ciphers. They demand 4 macro-ops in RISC-V, but only one in ARMv8.

Bitfield insertion is only one instruction in most RISC ISAs, but 5 or more in RISC-V.

Veedrac · on Nov 1, 2020

The (unfinalized) bitmanip extension has single-op rotates.

CalChris · on Nov 1, 2020

I really don't like it. It seems like they copied x86 (bext, bdep) where they should have been plagiarizing armv8 (BFM, ...).

brandmeyer · on Nov 1, 2020

I've watched that extension's development since it was little more than one smart guy's wish list. I don't think its fair to say that they copied any one architecture. The authors have put in a ton of time researching the tradeoffs and investigating the trade space over the years.

That said, until it gets ratified by the consortium and implemented in silicon its still just a (well-researched) wishlist.

pizlonator · on Nov 1, 2020

I’d have to reinvestigate more than I feel like doing right now so I’ll tell you where to look.

You need fast sequences for all of these variants:

- add or sub or mul

- signed or unsigned

- 32 bit or 64 bit

Some of those need 5 instructions. I don’t remember which adventure you need to pick to get 5.

Veedrac · on Nov 1, 2020

Thanks for the reply but I'm not sure what you mean. Are you saying you want to add an i32 to a u64? Or am I completely misunderstanding? I'm not sure why unsigned-unsigned or signed-signed instructions would be hard. Idk about mixed sign operations, but when do they ever occur without first casting one to the other?

pizlonator · on Nov 1, 2020

I mean you will do overflow checks on the following. I’ll use the “s” and “u” prefixes to mean signed and unsigned. Unsigned matters less than signed.

sadd32, sadd64, uadd32, uadd64, ssub32, ssub64, usub32, usub64, smul32, smul64, umul32, umul64

Veedrac · on Nov 1, 2020

Ah, I get you now. Yes, this looks like a weak point, and I totally get why it'd screw with Javascript optimizations.

I don't think it's a fusion problem; even if you did fuse these sequences, they'd still be bad, since they'd be writing lots of extra registers.

pizlonator · on Nov 1, 2020

Yeah exactly!

And worth noting that these are just the ones where you usually have good sequences on other cpus, but even then (like signed mul) they’re not perfect. Lots of room for improvement. It’s just that risc-v didn’t seize the opportunity.

dmix · on Nov 1, 2020

> particularly ones like x86 where what the processor actually runs looks nothing like the machine code.

(Quoted from OPs comment)

This isn’t a subject I’m an expert at but wouldn’t this mean there’s already some sort of translation going on already on the other systems? So it’s mostly just added end user work not a giant performance loss on RISC?

It would just then simply be a layer of abstraction that is lost.

lopsidedBrain · on Nov 1, 2020

Yes. x86 processors translate instructions to internal micro opcodes before scheduling or running them. Those look nothing like the x86 instructions.

monocasa · on Nov 1, 2020

What little we've seen of x86 micro ops (the great work reverse engineering the k10 microcode), shows that the micro ops look very much like x86 instructions. Still predominantly two address, RMW instructions for instance. Very similar to the original 8086's microcode structure rather than some reaction to the RISC movement, despite the common trope to state the contrary.

Can't wait to see more information about the goldmont microcode work to see if that holds for intel as well as it does for AMD.

arcticbull · on Nov 1, 2020

Some huge portion of the x86 power budget is just the instruction decoder. The instruction set is so nutty that trying to implement them all directly in hardware is a non-starter.

The x86 front end breaks down big instructions into smaller RISC-like micro-ops and then fuses/re-orders/optimizes/etc and runs those instead. There’s pros and cons, the con being sheer complexity and power budget, the pros are that it’s an abstraction so the microarchitecture can change completely without recompiling your code —- and you get CPU specific optimizations too. The CPU is basically emulating x86.

You could in theory build an x86 CPU with a RISC-V core behind that decoder.

brandmeyer · on Nov 1, 2020

> Some huge portion of the x86 power budget is just the instruction decoder.

This is a widely held meme, but the internet at large doesn't have any evidence to back it up. A couple of publicly visible engineers that do have experience are on record as saying that cell-phone-class competitive x86 was absolutely possible. Intel and AMD chose not to pursue those markets.

The expensive parts of a high-end CPU aren't normally in the instruction decode part. They are in the branch prediction, branch mispredict recovery, forwarding networks, memory re-ordering, and so on. Anything short of a dataflow ISA has little impact on those structures.

klelatti · on Nov 1, 2020

I'm sure that you're right that the instruction decoder power budget isn't a huge issue.

Don't think its strictly accurate to say that Intel 'chose not to pursue' the mobile SoC market - IIRC they tried, made little progress and gave up having spent a lot of money in the process.

arcticbull · on Nov 1, 2020

By large I mean up to 10%, and I have a study for you. [1]

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...

brandmeyer · on Nov 1, 2020

I skimmed the paper. They reported that the difference in total power spent in two(!) microbenchmarks when switching between L1I and the decode cache is 10% for one, and 3% for the other. The attribution of power was done entirely using a linear regression model on some core perf counters, and the core's own internal estimate for power consumption over the whole package.

I don't think you can generalize from that result to much of anything.

reitzensteinm · on Nov 2, 2020

I don't have a source, but back when ARM Zen was going to be a thing (hopefully it still will), AMD was claiming a 10% uplift in performance over x64.

That struck me as one of the few apples to apples comparisons ever of the instruction sets at the high end, from a party not really incentivized to bend the truth one way or another.

But it could have easily come from something like the relaxed memory model, or they could have just been overly optimistic. The chip was cut after all.

hajile · on Nov 2, 2020

You're running into the same problem, just from a different angle.

With the exception of Atom, all recent Intel designs have been a RISC core with a CISC decoder slapped on top. Everything else being equal, the simpler decoder will create a smaller chip. Because the decoder is always running all-out, the simpler decoder will also use less power.

x86 instructions are multi-length from 1 to 15 bytes. The cost to slice that up is always going to be bigger than fixed-length instructions. RISC-V has variable length in theory, but in practice, compact instructions extend into 32-bit instructions with some bits added which is important for decode cache while longer instructions are ignored.

Because there's a maximum tolerable decode latency of a very few cycles and latency increases with cache size, decode cache size has a definite cap. x86 has a couple orders of magnitude more potential instructions than RISC-V. More instructions translates into a lower hit rate for the same size cache barring any heuristics (more on that below).

Matching variable-length arrays to an unknown set of arrays in cache is inherently a hard problem. Every solution has tradeoffs and the resulting heuristics are bad for computing (see below). In contrast, searching for a match on a fixed 32-bit array has much more simple general solutions that don't require tradeoffs.

C and CPU designs feed off each other. Let's say there are instructions X and Y which can do equivalent things. x86 engineers played around with both and got a lucky insight into how to make X a bit faster. Compiler writers jump on it and start using the faster solution. x86 engineers now all but stop looking to improve Y and spend their time tinkering with X instead. Compilers now focus even more heavily around not just X, but any instructions more closely associated with X.

In that entire (true) story, nobody gave a second thought to whether the final result of Y would have been faster overall if not for the lucky break with X. If x86 developers were actually free to choose whichever instructions they wanted, x86 decode would be much, much slower than it appears to be. This self limitation argues that perhaps a more RISC-like ISA is inevitable.

A new ISA where everything is used would definitely have complexity, transistor, and power disadvantages vs a new ISA that didn't make that mistake.

MrBuddyCasino · on Nov 1, 2020

> A couple of publicly visible engineers that do have experience are on record as saying that cell-phone-class competitive x86 was absolutely possible.

Weren't there Windows Phone devices with x86 SoC, but they weren't competitive?

toast0 · on Nov 1, 2020

Public Windows Phone 7, 8, and Windows Mobile 10 devices were all arm. Intel cancelled further development on Atom for phones at roughly the same time Microsoft announced Windows Mobile 10's desktop convergence feature (plug phone into monitor and get a desktop-like experience, I think there was a wireless option as well). It's pretty simple to speculate this would have been way more nifty on an x86 chip, because Windows Mobile 10 was more or less Windows 10, so you would have a bigger software library, instead of only UWP applications.

Commercially made x86 Android phones exist, the most popular to my knowledge were some of the Asus ZenPhone models.

hajile · on Nov 2, 2020

It existed, but why bother?

When Atom was released in 2008, the A9 had already been announced (a year before). A9 was around 10-15% faster per clock and were often multicore meaning a 1.5GHz chip was faster in all metrics over the 1.6GHz Atom.

A few articles came out June/July of this year with Analysts saying that Intel had spent over 10 Billion dollars trying to break into the mobile market with no success. ARM's current R&D budget (according to Nvidia a month or so ago) is 0.5B. If they spent that much every year since 2000, they would barely match Intel, but their entire market cap in stayed under 4B all the way until 2009. Remember, that R&D includes their high-end ARM cores, but also GPU, midrange designs, various microcontroller designs, a realtime OS, ARM tooling, NPUs, various kernel support, etc

If that much money can't fix up x86 to keep up with a budget a fraction of the size, I take that as proof that the ISA really does matter.

ThrowawayR2 · on Nov 1, 2020

There never were any. Intel cancelled their Broxton phone SoC in 2016: https://www.anandtech.com/show/10288/intel-broxton-sofia-sma... (A major mistake on their part IMO.) There was some speculation back then that Microsoft might have been interested in x86 phones but no proof ever emerged.

jhallenworld · on Nov 1, 2020

You need the icache to be post-fusion..

bertr4nd · on Nov 1, 2020

It’s not just the I$ though, it’s the MLC, LLC, and I-TLB. And instruction density at those levels actually matters quite a lot for big binaries (my experience in this regard is with HHVM at FB, but it’s certainly not unique).

jhallenworld · on Nov 1, 2020

Yeah, pretty much agreeing with you. I think if they are going to rely on fusion in the ISA, they should limit it to the most minimal fusion implementation to have the widest possible chance of adoption. So assume the hardware can only fuse two instructions and design the ISA with this restriction in mind.

brandmeyer · on Nov 1, 2020

Intel and AMD's experience with complex transformations on the instruction stream suggests that its overall better to do it in a L0 decode cache.

CalChris · on Nov 1, 2020

I don't think the instruction sequence from the article would qualify for macro-op fusion. Berkeley looked at this for the simplest case of LEA [1]:

  // &(array[offset])
  slli rd, rs1, {1,2,3}
  add  rd, rd,  rs2

The sequence in the article uses what Intel calls the fast case but it still wouldn't qualify for Berkeley's two instruction fusion. Dunno if anyone does three instruction fusion.

As an aside, LEA is never getting added to the base RISC-V nor should it be. But I'm surprised it isn't considered for an extension.

[1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...

IshKebab · on Nov 1, 2020

> The idea is that high end processors will recognise these sequences

I'm not sure that is the idea, given that RISC-V is targeted at processors so low-end that they don't even implement multiply.

audunw · on Nov 2, 2020

That’s not true. RISC-V targets the full spectrum of processors (well, not 8/16-bit, so not the very low end). Some of the decisions they made, compared to say OpenRISC is made just to make the ISA easier to implement with superscalar CPUs (not having branch delay slots)

spacenick88 · on Nov 1, 2020

Instead of fusing them, shouldn't it be possible to speculate that it will not overflow, process the check on a separate slow path and do a roll back in case it did overflow?

8note · on Nov 2, 2020

I'm not super familiar with compiler and processor design, but why do I want the processor to do this optimization instead of the compiler?

brandmeyer · on Nov 1, 2020

> Also worth noting, since this always comes up, that these things are super hard for a compiler to optimize away. JSC tries very aggressively but only succeeds a minority of the time (we have a backwards abstract interpreter based on how values are used, a forward interpreter that uses a simplified octagon domain to prove integer ranges,

RISC-V has some closely-related sharp corners in indexed address arithmetic as well. Some choices for the type of the index variable perform much worse on rv64.

Consider: an LP64 machine uses 32-bit integers for 'int' and 'unsigned', but 64-bit integers for `long`, `size_t`, `ptrdiff_t` and so on.

If you use an array index variable of type `unsigned`, then the compiler must prove that wraparound doesn't happen. That's pretty weird considering that half the point of using unsigned is to elide such proofs of correctness. If it cannot prove the absence of unsigned wraparound, then it will be forced to emit zero-extension sequences prior to using the index variable to generate the addresses.

ARMv8 side-steps the whole problem by providing indexed memory addressing modes that include the complete suite of zero and sign extension of a narrow-width index in the load or store instruction itself.

So here we have an example of a three-way system engineering choice.

  - Provide a small amount of hardware that performs the operation on-demand.
  - Provide new and inventive forms of value-range analysis in the compiler.  Despite decades of research into this problem, the world's best solutions still frequently saturate at "the entire width of the type the programmer requested".
  - Change the habits of the world's C programmers.

RISC-V chose options 2 and 3.

saagarjha · on Nov 1, 2020

> If you use an array index variable of type `unsigned`

This is usually why your array indexing should be done with an iterator or size_t :)

kllrnohj · on Nov 1, 2020

size_t is unsigned.

brandmeyer · on Nov 1, 2020

(jumping up the thread to try and hop over some confusion...)

The problem isn't with unsigned types generally. Its with subregister unsigned types. So, size_t and uintptr_t are fine. uint32_t, uint16_t, uint8_t (on LP64 ABIs) are pessimized and demand zero-extension instructions (or proofs that they can be safely elided) prior to causing side-effects. uint64_t on a LP128 ABI would also be problematic.

signed 32-bit int is also fine... because RISC-V specifically has a suite of arithmetic instructions that unconditionally sign-extend from bit 31. Even without those, it would still be fine because the carve-out for undefined behavior is wide enough for INT_MAX+1 to remain positive. Same thing for all of the other narrow-width signed integer types. If you increment SHRT_MAX and then use it to index memory, its perfectly legal undefined behavior to access base + SHRT_MAX+1 instead of base + SHRT_MIN.

However, that's not legal for the unsigned types. They are all mandated to wrap in 2's complement. base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

a1369209993 · on Nov 2, 2020

> base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

Ironically, given the topic, that's actually not true, because `base + UINT_MAX+1` is `(base + UINT_MAX)+1` (with a pointer, not a unsigned int, as the temporary value). That should probably be `base + (UINT_MAX+1)`.

brandmeyer · on Nov 2, 2020

This whole spiel is only relevant when the programmer specifies an array index in a distinct variable.

saagarjha · on Nov 1, 2020

Right, but you don't need to sign extend it.

kllrnohj · on Nov 1, 2020

Eh? The problem with using unsigned is identical to using size_t. size_t doesn't have any special carveouts that eg. say that wrapping is undefined behavior. It's defined as an unsigned integer type. Meaning it has unsigned integer type behaviors.

saagarjha · on Nov 1, 2020

unsigned is 32-bit, which means you need to have it overflow in the lower 32 bits since you're working on it through a 64-bit register. If you use a full size_t then this check is not necessary.

brandmeyer · on Nov 1, 2020

You have a typo: size_t doesn't need zero extension. RISC-V does have 32-bit signed integer arithmetic that unconditionally sign-extends the result from bit 31. So using a signed integer as a counter (bad practice IMO, but widely recommended in style guides) doesn't need anything extra at all.

saagarjha · on Nov 1, 2020

Uh, wrong subthread?

fulafel · on Nov 2, 2020

Nitpick: s/LP64 machine/LP64 C implementation/

But isn't size_t (or ptrdiff_t) the preferred indexing type in C for this reason (among others)? Sometimes you of course do want wrap around modulo semantics but that's much rarer, right?

_chris_ · on Nov 1, 2020

The problem is that providing extra bits for "sign-extension mode" and "read 32b or 64b" blows through the opcode space very quickly.

zozbot234 · on Nov 1, 2020

The RISC-V spec includes recommended code sequences to check for overflow, so that the hardware can potentially use insn fusion as an optimization. The "bad" cases you mention can be a bit clunky, but they should also be rare.

pizlonator · on Nov 1, 2020

I’m aware of those sequences and it’s a myth that they will be rare. For dynamic languages they will be super common.

brandmeyer · on Nov 1, 2020

We know the origins of that myth by examining the papers that the RISC-V designers wrote. They got a C compiler back-end working and didn't incorporate any other languages in their benchmarking corpus.

CalChris · on Nov 1, 2020

This reminds me of Peter Drucker, "What gets measured gets improved". Conversely, if you don't measure it, it usually doesn't get improved.

brandmeyer · on Nov 1, 2020

Its a tough situation to be in, especially when RISC orthodoxy elevates measurements to a paramount position in the ISA design process. I don't think they were wrong to do so, either. The error was in finalizing the ISA without more backends to work with.

How much business code out there burns most of its cycles in Java/C#/Rust/Go? Having even one of those (lets face it: Java) would have gone a very long way. How much client computing spends most of its cycles in (JITted) javascript?

I understand the tradeoff that drives this omission, though. How many VC megabucks (Tens? Hundreds?) would they have had to spend commercially on a bet? Alternatively, how many grad student-years grinding through low payoff work to prove yea/nea on the value of checked arithmetic?

Lets say that you make a hip-shot bet on some checked arithmetic support without the supporting toolchains and application code to back up the design choice. You could end up making a different set of mistakes instead and end up in the same situation.

I think the only mistake was in finalizing the ISA without any support for checked arithmetic. My belief is that doing it well will not be orthogonal to the rest of the ISA's design, and therefore is a poor candidate for an extension.

convolvatron · on Nov 1, 2020

C really did screw things up by just refusing to address overflow in any form, didn't it

brandmeyer · on Nov 1, 2020

I don't think C screwed up. C's objective was to standardize a set of practices based on what the hardware of the day was capable of. If they had pessimized performance on one architecture because it failed to provide efficient checked overflow (or any other feature), then C would have never made it to standardization.

ISO C is therefore a three-way compromise between three communities: Software authors, compiler authors, and CPU manufacturers. You will naturally end up with some decisions that dissatisfy some members of each.

Gibbon1 · on Nov 1, 2020

Better would be to say that the C standard committee screwed things up by refusing to address a lot of issues over the last 35 years.

saagarjha · on Nov 1, 2020

Not really, they have addresses pretty much all of the issues. Now, you can argue that they favored compiler authors when doing so, but they have consider the problems and done things in response to them.

pizlonator · on Nov 1, 2020

Yes. In many ways:

1) it’s pretty important to have easy syntax for math with overflow checks. Otherwise there will be bugs. Bad bugs. We have those and it sucks.

2) it’s pretty important that if the overflow wraps in ISA then it wraps in language semantics but C just says “meh whatever” for signed overflow. That leads to even more bugs.

GeorgeTirebiter · on Nov 1, 2020

The HP3000 had "SPL" - Systems Programming Language -- as its way of doing what C does. It had a specific syntax for checking the error codes: if <> then <whatever> so whatever previous arithmetic did, you could check the machine's flags directly. Can't do that in C. https://en.wikipedia.org/wiki/Systems_Programming_Language

the5avage · on Nov 2, 2020

At the time C was designed not all machines did represent signed integers in 2 complement. Therefore it was not possible to define behavior for signed overflow. They probably should change that 2020^^

GCC has intrinsics for integer math with overflow checks

ncmncm · on Nov 2, 2020

C++ formalized two's complement already. Formalizing power-of-two word size and 8-bit bytes might come.

pizlonator · on Nov 1, 2020

That’s my suspicion too!

I don’t think these folks have seen what modern languages do.

jhallenworld · on Nov 1, 2020

Maybe overflow checking could included as an ISA extension. If it is included, what is the least impactful design?

Overflow is part of the result, so maybe include extra bits to each register that can be arithmetic destination. These bits are not included in moves, but could be tested with new instructions.

Another way that avoids flags is new arithmetic instructions: add but jump on overflow. Maybe this is reduced to add and skip next instruction except for overflow, but maybe things are simplified if the only allowed next instruction is a jump, so the result is a single longer instruction.

jhallenworld · on Nov 1, 2020

After thinking about this some more: I think the extension instruction should work like "slt" (set on less than). So we have "sov"- set if add would overflow:

    add t2, t1, t0
    sov t3, t1, t0
    bnez t3, overflow

Why this way? "extra bits on destination registers"- this is really flags. The flags have to be preserved during interrupts, so extending the registers is not so easy (I think it just reduces to classic flags).

"add but jump on overflow" or "add and skip on no overflow"- I don't like this because you can not break it into separate operations without flags. I think you might have to add hidden flags in a real implementation.

An add followed by an sov could be fused, but requires an expensive multi-register write. Fusing maybe could be more likely if the destination is always to a fixed destination register:

    add t2, t1, t0
    sov tflags, t1, t0
    bnez tflags, overflow

bshanks · on Nov 8, 2020

If the destination of sov is specified to always be tflags, then you may as well combine add and sov into one instruction, with tflags implicit:

addsov t2, t1, t0

pizlonator · on Nov 1, 2020

The best design is control bits which RISC-V doesn’t have but x86 and ARM do have.

The best design is part of the core ISA and not and extension since overflow checking is fundamental to modern languages.

CalChris · on Nov 1, 2020

Agreed. Waterman's thesis says this about that:

"Several ISA features, including implicit condition codes and predicated moves, are onerous to implement in aggressive microarchitectures. Yet, their complexity often does not result in higher performance because their semantics were ill-conceived. For example, x86 provides a conditional load instruction, but, if the unconditional load were to cause an exception, it is implementation-defined whether the conditional version would do so. Thus, a compiler can only rarely use this instruction to perform the if-conversion optimization.

Recognizing the inefficiency of their conditional operations, Intel’s recent implementations go to some lengths to fuse comparison instructions and branch instructions into internal compare-and-branch operations."

Yeah, doing condition codes right is complicated but then that's the purpose of microarchitecture, to factor that complexity out.

RISC-V succeeds in having a minimalist to a fault design which is appropriate for a certain design point. ARMv8 and x86_64 are more useful for a broader set of designs. The burden now is on RISC-V to show that their minimalist approach is fast+efficient rather than just simple. You have to get something out of the simplicity; otherwise what you get is design debt.

wbl · on Nov 1, 2020

Control bits as in ARM and x86 force serialization of arithmetic due to the RW dependency in every instruction on that bit. There are some tricks but it still needs tracking. For higher order superscalar or out of order processors this gets annoying.

ansible · on Nov 1, 2020

Yes, the old, old way of having a single condition code register or the like (which dates back 40+ years) doesn't work well these days.

I like the Mill CPU approach, where every "register" (it doesn't have named registers actually) has the full set of status bits associated with it, and not just for overflow. Things like "not a result" (NaR), which can represent the result of a failed speculative load for example (because the process doesn't have permission to read from that page, for example).

saagarjha · on Nov 1, 2020

I thought that compilers couldn't really use this effectively…

ansible · on Nov 2, 2020

> I thought that compilers couldn't really use this effectively…

The status bits part in general, or the speculative load stuff?

They allegedly have all this working, privately. They haven't released any development tools or such to the public.

I've often toyed with the idea of writing an instruction-level simulator (as opposed to the RTL sim or whatever they have internally). But even sticking to the public information, I'd likely be infringing on their patents.

CalChris · on Nov 1, 2020

No. Control bits (status bits, flags, ...) get renamed just as registers get renamed.

Basically, if there's a bottleneck to x86 code, Intel has run into it, profiled for it and generally optimized around it both in their microarchitectures and in the their C compiler.

wbl · on Nov 1, 2020

That's one of the tricks. But it doesn't solve the issue of clobbers, which Intel had to introduce new variants of ADD and MUL to solve. Named predicate registers make it all much easier for everyone.

CalChris · on Nov 1, 2020

Thanks. I knew about condition code renaming from discussion with an Intel compiler engineer. I didn't know about clobbers and I'll read up on that.

wbl · on Nov 1, 2020

https://en.m.wikipedia.org/wiki/Intel_ADX is the solution Intel created.

tom_mellior · on Nov 1, 2020

ARM has separate instruction variants with and without setting of flags. Normally one uses the flag-less versions, so you don't have this problem.

pizlonator · on Nov 1, 2020

I think what you’re saying is basically true but it’s a trade against density.

If you did overflow checks rarely then what you say is a very good point indeed. The key thing is just the frequency of this stuff in modern languages.

rwmj · on Nov 1, 2020

You need to either write a paper about this or produce some evidence.

pizlonator · on Nov 1, 2020

I don’t need to do anything. You’re welcome not to take my advice. :-)

If you doubt that dynamic languages are using overflow checking in the way that I describe then it’s not because of lack of papers on the subject.

tom_mellior · on Nov 1, 2020

> JSC [...] a forward interpreter that uses a simplified octagon domain to prove integer ranges

Off-topic, but could you point me to more details on this? Someone (else?) recently mentioned octagon analysis in JSC in a HN thread. I grepped through the sources at the time but didn't find any indication that it exists. At least not under the name "octagon".

pizlonator · on Nov 1, 2020

I didn’t call it octagon when I wrote it because I didn’t know that I reinvented a shitty/awesome version (shitty because it misses optimization opportunities, awesome because by missing them it converges super fast while still nailing the hard bits).

Look at DFGIntegerRangeOptimizationPhase.cpp

spacenick88 · on Nov 1, 2020

I wonder how this interacts with branch prediction. Since overflows should happen very rarely I guess the branch on overflow should almost always predict as non taken. So wouldn't it be possible to have a "branch if add would overflow" instruction or even canonical sequence that a higher end CPU can completely speculate around and just use speculation rollback if it overflows?

I think an important design point here is that the languages that need a lot of dynamic overflow checks are primarily used on beefier CPUs so if you can get around the code size issue, making it performant only on more capable designs is fine since the overflow check will be rare on simpler CPUs.

pizlonator · on Nov 1, 2020

I don’t think that beefier cpu and overflow checks are that related. I mean, you’re right, I just want to place some limits on how right you are.

1. Folks totally run JS and other crazy on small CPUs.

2. Other safe languages (rust and swift I think?) also use overflow checks. It’s probably a good thing if those languages get used more on small cpus.

3. The C code that normally runs on small cpus is hella vulnerable today and probably for a long time to come. Compiling with sanitizer flags that turn on overflow checks is a valuable (and oft requested) mitigation. So theres a future where most arithmetic is checked on all cpus and with all languages.

And yeah, it’s true that the overflow check is well predicted. And yeah, it’s true that what arm and x86 do here isn’t the best thing ever, just better than risc-v.

spacenick88 · on Nov 1, 2020

Interestingly it seems rust only does full overflow checking in debug builds: https://huonw.github.io/blog/2016/04/myths-and-legends-about...

masklinn · on Nov 2, 2020

By default yes, but you can enable overflow checking in release mode (it’s a conf / compiler flag), and it has standard functions for checked, wrapping, and saturating ops.

spacenick88 · on Nov 1, 2020

Yeah I know about 1 e.g. also MicroPython, no idea if that's used outside DIY though. I agree about rust but I would think that with much stronger type safety and static compilation it should be able to remove a lot more of the overflow checks and most that remain would be needed in correct C too. At least that's what I learned from my compilers prof who worked on Ada compilers for many years and that should be quite similar. But maybe that's my biased hope as I really really hate working with dynamic languages.

brandmeyer · on Nov 1, 2020

The current world record holder (in the published literature) for branch prediction is TAGE and its derivatives. The G stands for Geometric. It is composed of a family of global predictors that increase in length with a geometric progression. That's somewhat relieving since it means that the storage growth is not unlike that of mipmapping in computer graphics. A small constant k times maximum history length N.

But to a first approximation, if you double the density of conditional branches in the program, then you will need to roughly double the size of the branch prediction tables to get the same performance, even if all of them are correctly predicted 100% of the time.

avianes · on Nov 2, 2020

RISC-V spec is not yet finished.

Currently only the most basic extensions are available. But nothing prevents RISC-V from introducing in the future an extension that extends the conditional code or an extension for integer/float overflow.

bertr4nd · on Nov 1, 2020

I’d be curious to see the instruction sequences for handling overflow without condition codes. I’m not even sure I see how to do it as efficiently as 3 or 5 instructions :-/

pizlonator · on Nov 1, 2020

One example of 3 is branching on 32-bit add overflow on a 64-bit cpu where you do a 32-bit add, a 64-bit add, and compare/branch on the result.

simias · on Nov 1, 2020

I enjoyed reading this a lot, I keep seeing RISC-V being touted as a potential replacement for ARM but I had yet to read a good critique of the ISA by people who know what they're talking about.

This point I didn't quite understand:

>Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.

Most successful ISAs (including ARM) have their share of extensions, coprocessors, optional opcodes etc... ARM has the various Thumb encodings, Jazelle, VFP, NEON and more. Toolchains and embedded developers are used to dealing with optional features of computers, I'm not sure why RISC-V would fare worse here.

Beyond that I notice that many of the ascribed weaknesses are shared with other RISC ISAs like MIPS (but not ARM):

- No condition codes

- Less powerful, simpler instructions that require more opcodes to do the same thing but can potentially run faster.

- No MOV instruction

- The "unconstrained extensibility" is arguably a thing on MIPS too, with the four coprocessors that can be used to implement all sorts of custom logic.

Of course ARM has been more successful than MIPS, so maybe it's a sign that those things are indeed bad idea but given that this comes from an ARM dev I wonder if part of it is not just "that's now how ARM does it".

On the other hand I must say that I was surprised that RISC-V made multiplication optional, in this day and age it seems like such a useful instructions that it's well worth the die area. Optional DIV I can understand, but an ISA without MUL? That's rough, even for small microcontroller-type scenarios.

tom_mellior · on Nov 1, 2020

> ARM has the various Thumb encodings, Jazelle, VFP, NEON and more.

Having done just a tiny bit of compiler development for ARM, I can assure you that having all of these variants is a pain. Making compiler writers' lives harder means you're less likely to get optimal performance. At least on the more exotic variants, but possibly even on the most common ones.

simias · on Nov 1, 2020

>Having done just a tiny bit of compiler development for ARM, I can assure you that having all of these variants is a pain. Making compiler writers' lives harder means you're less likely to get optimal performance. At least on the more exotic variants, but possibly even on the most common ones.

I can empathize, but isn't that just part of the job of making a compiler? Any successful, long-lived ISA is going to have extensions and revisions that will need to be handled in the toolchain. I guess my point is not so much that it isn't painful, it's more that I don't really see what makes RISC-V really different besides the fact that it's a younger ISA and therefore we don't already know for sure which extensions are going to become de-facto standard and which ones will be less common.

>I believe the author doesn't identify as a "guy".

Arg, of course the one time I don't use gender-neutral language I manage to mess it up. Edited, thanks.

pizlonator · on Nov 1, 2020

As a compiler pro, I view availability of better instructions to select as an asset rather than as a liability. Sure it’s more work for me and my team but if it makes shit fast then who cares how much work it was.

One of the best lessons I got when I was being inducted into the compiler club was: compilers are hard. It’s a hard job so other people can have easier jobs. It’s ok if compilers turn complex and managing that complexity is just something you have to learn to do. I don’t think it’s true that the need for that complexity leads to lower perf.

blueflow · on Nov 1, 2020

> I believe the author doesn't identify as a "guy".

I don't think this was meant as assumption about the authors gender. The same way i wouldn't assume that there is physical, actual pain involved when you said "having all of these variants is a pain" even when you literally wrote it.

floatboth · on Nov 1, 2020

All this Thumb etc. stuff is not relevant to the 64-bit world though. AArch64 is the least fragmented of the big ISAs, with a solid list of base functionality — e.g. NEON is guaranteed to exist on everything.

plorkyeran · on Nov 1, 2020

That's mostly because it's new and hasn't had time to fragment. NEON everywhere is great, but eventually there's going to be a NEON 2 that obviously will only exist on newer chips. People seem to generally regard thumb encodings as a mistake so we probably won't get a repeat of that, but I'd be shocked if similar divergences don't develop over the years.

cesarb · on Nov 1, 2020

> but eventually there's going to be a NEON 2 that obviously will only exist on newer chips.

Isn't it SVE? So far, it has been only implemented AFAIK by the A64FX, which is used by the current #1 supercomputer in the TOP500 list, but it wouldn't surprise me if we start seeing it on newer 64-bit ARM chips.

skavi · on Nov 2, 2020

SVE2 is guaranteed in ARMv9.

my123 · on Nov 1, 2020

Neoverse V1 has SVE (with 2x256b units) and Neoverse N2 has SVE2 (w/ 2x128b units).

Note that SVE is a superset of Neon and SVE2 is a superset of SVE.

brandmeyer · on Nov 1, 2020

ARMv8 is 8 years old at this point.

RISC-V's 2.2 (final, stable) ISA came out in 2017 and its already fragmented.

Symmetry · on Nov 1, 2020

Thumb is very useful in things like microwaves where using less memory save the manufacturer a little money.

You do have different chips with different instructions but in a very regularized way. An ARM v8.2-A chip is going to have the same instructions whether it's made by ARM, Samsung, or Apple. And when v9 comes out they'll have NEON's SVE replacement everywhere and you'll be able to use the same code regardless of whether the SIMD width is 128 bits or 512.

ksec · on Nov 1, 2020

Is that SVE on ARM v9 confirmed?

I have yet to see any concrete details on ARM v9, generally speaking ARM v8 is pretty damn well designed I am wondering what v9 will look like.

skavi · on Nov 2, 2020

It’s generally expected that SVE2 will be required for ARMv9. SVE2 is a more logical successor to NEON than SVE.

https://community.arm.com/developer/ip-products/processors/b...

brucehoult · on Nov 1, 2020

NEON is guaranteed to exist on everything, and this means you're never going to see Aarch64 replace the Cortex M0 and M3.

That's fragmentation right there. Severe fragmentation. Two completely incompatible ISAs.

Small 32 bit RISC-V comes in smaller and lower power than An M0, and small 64 bit RISC-V is not much bigger than an M0 and is rather popular controlling something in the corner of a larger 64 bit SoC.

notanotherycom1 · on Nov 2, 2020

I don't understand why is it a pain? Are you saying it makes choosing what encoding/extension more difficult? I would only see this as a pain if you had to mix/match these extensions and one isn't guaranteed to exist when another does making you have to write several versions for every mix and match.

But I kind of assumed most ops perform well enough and important optimizations have a test case for and will get done

leeter · on Nov 1, 2020

Example: The Commodore 128 (or the Plus/4 works too). The C128 had a more powerful CPU capable of 16-bit work, accessing more RAM, faster Disk access etc. But in general it was rarely targeted. Why? Because developers were looking for the widest degree of compatibility and that meant restricting themselves to the minimal possible subset of compatibility between the C128 and the C64. This meant that by and large the 128KB of ram went unused as most applications ran in C64 mode.

The same applies here: the more extensions and things you pile on... the less likely they are to get used unless they are de-facto mandatory. Even today you can see games getting released that won't touch AVX instructions on both Intel and AMD because of compatibility reasons.

Valve provides data to developers on penetration of various ISA extensions via the hardware survey ( https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw... ) but for RISC-V there is no way to do the same. So most utility writers will be highly constrained in what they will use in terms of expected extension use, that will have significant harms in terms of performance. Alternatively it requires recompiling for every single different target, which is also likely.

In essence: you need a good baseline of compatibility for people to expect to use. It makes moving software easier between targets. A piece of software might certify for example on R64GC but not on R32IF because the double precision emulation might not work as expected or the lack of carry could be an issue etc.

simonh · on Nov 1, 2020

For embedded applications, which are RISC-V’s bread and butter, the hardware and software are designed hand in glove. Eg if you are designing a custom RISC-V chip for a media encoding system you will want extensions for that application, and make use of those in your development tool chain. The software for your Hardware, or at least the performance and functionality critical part, is often only targeted at your hardware, not any arbitrary RISC-V system.

devit · on Nov 2, 2020

This is just a tooling problem that can be solved trivially by having release builds build multiple binaries by default, for all the major extension profiles.

leeter · on Nov 2, 2020

You completely missed the point, multiple release builds doesn't actually fix the problem of emulation not having the same behavior or timing. That has a big impact on software cert for purpose and could completely torpedo an entire project where a dev team works on a GC profile because actual hardware doesn't exist yet and is supposed to release on an I profile. There is no substitute for actual hardware.

klelatti · on Nov 1, 2020

The point on unconstrained extensibility I think is largely around the fact that anyone can add extensions wheras with ARM, the company retains control.

There are advantages to the RISC-V approach it is likely to lead to more fragmentation - and worse gives the ability for a major implementation to add proprietary extensions that are not licensed to anyone else putting smaller players at a disadvantage and leading to fragmentation, not only in the hardware but also in the software ecosystems.

Whilst you may not like ARM having control at least everyone (for a fee) has full access to the ISA and implementations.

Symmetry · on Nov 1, 2020

The extensibility that leads to the danger of fragmentation for general purpose computing is a great advantage for embedded computing where your software or firmware is targeting one particular piece of hardware and doesn't have to be compatible with anyone else. Western Digital is free to put in the mix instructions they need for their hard driver controllers and NVidia is free to put in the instructions they need to control their video cards and the incompatibility between them just doesn't matter.

klelatti · on Nov 1, 2020

Which is great and fine as no-one else will be writing software for that particular hardware. I believe that ARM has been allowing some extensions for their M series designs in these circumstances - partly due to pressure from RISC-V alternatives.

I should add that I think that it's possible that Nvidia/ARM combination will remove ARMs 'level playing field' and we might see Nvidia only extensions for their designs - which would not be good. We'll have to see.

notanotherycom1 · on Nov 2, 2020

> I had yet to read a good critique of the ISA by people who know what they're talking about.

I still wonder about RISC-V. To me, it seems pointless. But a lot of companies are buying into it so I'm wrong

Why would you ever want a standard ISA? If you're buying chips you either want a cheap standard one or a powerful efficient one. To be efficient (or cheap) you'd want to only support what's required and what works best with the implementation.

I don't really understand the point of a generic ISA. Why not have some kind of bytecode or standard format (like llvm-ir) that gets optimized for the CPU and gets a native binary that doesn't need interpretation.

Like how the f* is it easier to make something regular+generic fast rather than something custom for your hardware/chip/cpu fast?

Do you want to know how many times I used XML when it's not required? 0. Do you know how many times I used SQLite or my own binary file? I lost count. SQLite has far more constraints than XML and custom binary files/formats aren't hard after you done than a few times.

eddyb · on Nov 2, 2020

Betting on smart compilers without putting in the effort to build them is how the Itanic happened.

notanotherycom1 · on Nov 2, 2020

Damn dude. I never thought of that.

So are you here right now declaring that RISC-V and all those companies are in the wrong and risc-v will be a disaster?

Cause I might agree and be with you on that lol

-Edit- I have no idea what the state of the compiler is

https://github.com/riscv/riscv-gnu-toolchain

> Warning: git clone takes around 6.65 GB of disk and download size

WTF?

If you're going that complex than... wtf?

eddyb · on Nov 10, 2020

Not at all, RISC-V doesn't need extremely clever compilers, and instead it's designed to maximize what the microarchitectures (hardware implementations) can do, and reduce unnecessary overhead.

My reply was mostly aimed at the idea that you can move the complexity from hardware into compilers: it might be possible, but we know how to build out-of-order CPU better than we know how to build smart compilers, so you have to invest a lot more research time, and it's generally a lower priority.

Even innovations from the past decade or two, like VSDG, haven't made their way into "industrial" compilers yet.

As for the size thing:

GCC and Clang are huge (at least when you include their entire change history, which git does), RISC-V is comparatively only a tiny part of them, you should probably look into that further before jumping to conclusions.

You don't even need a whole separate toolchain with Clang or Rust, the whole "need to build GCC yourself to cross-compile" is outdated GNU tradition, not some kind of technical necessity.

Veedrac · on Nov 1, 2020

I've never really been a fan of this take. It seems to rest a lot on the ideas that:

1) Fusion is hard. While it can be hard (fusing x86 will be), I do not see why it would be meaningfully difficult for RISC-V hardware (except on devices so small it's better to have the simpler base ISA anyway), or compilers, who can in the worst case just treat fused pairs as their own instructions.

2) There's anything wrong with just most software assuming a fairly fixed set of extensions, as seems to have happened. If microcontrollers want to use a subset without multipliers, that doesn't mean anyone else has to care. If bitmanip is stabilized before RISC-V breaks into more common consumer use, why not assume it when writing code? It's only a problem if people make it one.

Most of the rest don't matter much in a global sense. The arguments about which operations go in which extensions might have meaningful merit, but it seems not very important to me.

Jasper_ · on Nov 1, 2020

Fusion is hard. At least, unconstrained fusion is hard. If you have a couple instructions pairs to fuse together, that's fine, but fusing any possible compiler output together is going to make decode even more complicated in practice.

This is why Intel publishes software optimization guides that go over what their fusions are, but it doesn't seem like the RISV-C spec is going to do that yet for many cases. And compiler authors need to know which instruction stream to generate to ensure fused execution.

Veedrac · on Nov 1, 2020

Mostly the concern around the lack of instructions in RISC-V revolves around a few well-known cases (eg. indexed loads) where the instructions to fuse are pretty canonical.

There is always room for creativity, but that would be the same with or without indexed loads in the base instruction set. Any non-monopolistic hardware ecosystem has this problem; we've been able to ignore it largely on x86 since Intel had had a performance monopoly for so long, but once you have multiple competing core implementations compilers will have to worry about the edge-case performance differences.

What I'm talking about is more specific to groups of instructions that are safe to treat as fused by default. Note that even if the compiler outputs a pair of instructions but the hardware running the code doesn't fuse it, out-of-order execution means the penalty will generally be extremely small versus the best unfused instruction schedule.

RISC-V does give guidelines on which instructions are good fusion candidates. See for example section 2.13 in the bitmanip extension document.

Hardware, naturally, just has a fixed set of fusions it does.

pizlonator · on Nov 1, 2020

fusion is not just hard, it’s opportunity cost. Let’s say you have budget to implement the top 10 most important fusions. On other ISAs you’d use that budget on things that aren’t about condition codes or array access.

Veedrac · on Nov 1, 2020

Opportunity cost in what sense? If the ISA is simple and regular, then the silicon costs should be miniscule, and the engineering costs not meaningfully larger than if those instructions were separate unfused instructions that had separate encodings instead.

pizlonator · on Nov 1, 2020

For a 5-instruction massacre that uses tmp registers along the way, I guarantee you it won’t be easy.

Veedrac · on Nov 1, 2020

If I thought 5 instruction fusions were necessary, I would not be a fan of fusion either.

pizlonator · on Nov 1, 2020

The checklist for what needs to be fast is every combo of:

- add or sub or mul

- 32 bit or 64 bit

- signed or unsigned.

I don’t remember which adventure you need to pick to get 5.

Source: I had to make most of these fast to make JSC competitive.

Edit: I said all, should have said most. Unsigned is less important for JS.

saagarjha · on Nov 1, 2020

Some discussion back when it was written in 2019: https://news.ycombinator.com/item?id=20541144

bonzini · on Nov 1, 2020

The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust.

The ugly parts are indeed all ugly, though they have now added hint instructions.

fulafel · on Nov 1, 2020

Have decent speedups been gotten by previous CPUs by the addition of conditional moves? IIRC for some the SPECcpu impact was negligible, amd many RISCs don't have it. RISC is about quantifying this kind of thing and skipping marginal additions after all.

gergo_barany · on Nov 1, 2020

> Have decent speedups been gotten by previous CPUs by the addition of conditional moves?

This is not a direct answer to your question, but: I recently had to tune the conditional move generation heuristics in the GraalVM Enterprise Edition compiler. My experience has been that you can absolutely get decent speedups of 10-20% or more with a few well-placed conditional moves. The cases where this matters are rare, but they do occur in some real-world software, where sticking a conditional move in some very hot place will have such an impact on the entire application. Conversely, you can get slowdowns of the same magnitude with badly placed conditional moves.

It's a difficult trade-off, since most branches are fairly predictable, and good branch prediction and speculative execution can very often beat a conditional move.

PeCaN · on Nov 1, 2020

I'm not sure about this "RISC way" stuff. From a uarch standpoint the RISC vs CISC distinction is moot and from an ISA standpoint the only real quantifiable difference seems to be being a load-store architecture.

ISAs without conditional moves tend to have predicated instructions which are functionally the same thing. I'm not actually aware of any traditionally RISC architectures that have neither conditional moves or predicated instructions. While ARMv7 removed predicated instructions as a general feature ARMv8 gained a few "conditional data processing" instructions (e.g. CSEL is basically cmov), so clearly at least ARM thinks there's a benefit even with modern branch predictors.

Conditional instructions are really, really handy when you need them. It's an escape hatch for when you have an unbiased branch and need to turn control flow into data flow.

fulafel · on Nov 1, 2020

We were talking ISAs so let's focus on that.

The quantifiability comes from measuring results when you give compilers new instructions, vs paying implementation complexity (time, money and future baggage to support the insn forever). The upsides and downsides here come in different units so it's still tricky.

Lots of instructions can be proposed with impressive qualitative speeches convincing you how dandy they are, but in the end it's down to the real world speedup yield vs the price you pay in complexity and resulting second order effects.

(In rarer cases the instructions might be added not for performance reasons but to ease complexity and cost, that's where qualitative arguments still have a place when arguing for adding instructions).

It's fine if we don't have the evidence in this thread - I was just asking on the off chance that someone can point to a reference.

PeCaN · on Nov 1, 2020

It's not like someone is proposing some crazy new instruction to do vector math on binary coded decimals while also calculating CRC32 values as a byproduct. It's conditional move. Every ISA I can think of has that.

fulafel · on Nov 1, 2020

This prompted me to look through some RISC ISAs (+x86), there may be errors since I made just a cursory pass.

Seems the following have conditional moves: MIPS since IV, Alpha, x86 since PPRo, SPARC since SPARCv9

The following seem to omit conditional moves: AVR, PowerPC, Hitachi SH, MIPS I-III, x86 up to Pentium, SPARC up to SPARCv8, ARM, PA-RISC (?)

PA-RISC, PowerPC, ARM at least do a lot of predication and make a high investment to conditional operations (by way of dedicating a lot of bits in insn layout to it), but also end up using it a lot more often than conditional move tends to be used.

brandmeyer · on Nov 1, 2020

ARMv7's Thumb2 has general predication of hammocks via "if-then", and ARM itself had general predication. ARMv8 has conditional select, which is quite a bit richer than conditional move. POWER has "isel". Seeing an ISA evolve a conditional move later in life is pretty strong evidence that it was useful enough to include. So would modify your list to be:

ISAs that evolved conditional move:

  - MIPS
  - SPARC
  - x86
  - POWER (isel)

ISAs that started life with it:

  - ARM (via general predication)
  - Alpha
  - IA64 (via general predication)

fulafel · on Nov 2, 2020

Good list.

Observation re list of ISAs that evolved conditional move vs ISAs that omit conditional move: MIPS, POWER, x86, SPARC all targeted high power "fat core" applications at the point where it got added. AVR, Hitachi SH, PowerPC didn't add it while being driven more by low power / embedded applications. And many ISAs continued to see wide use in the pre-cmov versions of the ISA in embedded space (eg MIPS) after the additions. (PowerPC even removed it when being modeled after POWER)

gsnedders · on Nov 1, 2020

To be clear for anyone not so up-to-speed on this: what AArch64 has (conditional select) is strictly less expressive than AArch32 (general predication).

The take away there is that general predication was found to be overly complex where the vast (vast!) majority of the benefit can be modelled with conditional select.

brandmeyer · on Nov 1, 2020

Its less than general predication, but a little bit more than cmov/csel. The second argument can be optionally incremented and/or complemented. Combined with the dedicated zero register, you can do all sorts of interesting things to turn condition-generating instructions into data. A few interesting ones include:

   y = cond ? 0 : -1;
   y = cond ? x : -x;
   x = cond ? 0 : x+1;  //< look ma, circular addressing!

pizlonator · on Nov 1, 2020

Yes. There are cases where cmov is a killer beast and for example it makes your browser faster.

JSC goes to great efforts to select it in certain cases where it’s a statistically significant overall speed up. I think the place where it’s the most effective for us is converting hole-or-undefined to undefined on array load. Even on x86 where cmov is hella weird (two operands, no immediates) it ends up being a big win.

ncmncm · on Nov 2, 2020

You get 2x speedup on Quicksort and all related algorithms using CMOV instructions, so: yes.

https://cantrip.org/sortfast.html

saagarjha · on Nov 1, 2020

Yeah, IDK about the RISC ISAs–they seem to be designed around being architecturally simple (and I guess easy to teach?) but they really don't seem to map back to actual code at all, nor do they seem particularly grounded in hardware design either. (Or sometimes they're too close to the hardware and burn themselves…)

d33 · on Nov 1, 2020

Could it be because of some patents that made it impossible to do it properly?

nickik · on Nov 1, 2020

No. The just don't share the opinions. RISC-V is designed to go from the smallest possible core to high performance compute. Bit instructions will be in the 'B' extension.

ThrowawayR2 · on Nov 1, 2020

Bitwise rotation instructions date back to at least the PDP-8. Even if they were patentable, the patents would have expired long ago.

vardump · on Nov 1, 2020

> I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction...

Didn't check, but I suspect that decodes at least into two microinstructions.

pbsd · on Nov 1, 2020

Not only is it a single uop for the last 10 years of Intel chips, you can also run 2 of them per cycle.

pizlonator · on Nov 1, 2020

I’m assuming a is a constant in your example and that you’re doing a(b, b, 8). That’s one cycle on modern intels I believe (I think the manual promised this for Nehalam). OP also alludes right this fact when talking about fusion.

varispeed · on Nov 1, 2020

Can you not add these instructions? At least when you use FPGA IP, it can be done. But you would have to update the toolchain etc to support these new instructions.

Veedrac · on Nov 1, 2020

They have added them, it's just they're in bitmanip, which isn't finalized, nor is the extension mandatory.

tralarpa · on Nov 1, 2020

I thought the scale factor is either 1,2,4 or 8?

saagarjha · on Nov 1, 2020

You can combine them. For example, [rax+rax*8+1] (base register, register shifted by 8, constant).

tralarpa · on Nov 1, 2020

Isn't the scale factor encoded in just two bits? (i.e. 00=1, 01=2, 10=4, 11=8)

saagarjha · on Nov 1, 2020

Just edited with an example; I’m on my iPhone so assembling via nasm takes an extra minute ;)

tralarpa · on Nov 1, 2020

Thanks for the example. I was assuming that a and b are variables in OPs posting.

bonzini · on Nov 1, 2020

True that. In my case it was 12+x*9, which is log2 of the page sizes on x86 (4K, 2M, 1G).

dathinab · on Nov 1, 2020

Having taken a look at the RISC-V ISA spec I'm wondering if they did cripple LL/SC (LR/SC in RISC-V).

Basically:

- LL/SC can prevent ABA if the ABA-prone part is in-between a LL and SC instruction

- To have a ABA prone problem you need some state implicitly dependent on the atomic state but not encoded in it. Normally (always?) the atomic state is a pointer and we depend on some state behind the pointer not changing in a context of a ABA situation (roughly ~ switch out ptr, change ptr target, switch back in ptr, through often more complex to prevent race conditions).

This means in all situations I'm aware of LL/SC only prevents the ABA problem if you at least can do one atomic (relaxed ordering) load "somehow" depending on the LL load. (LL load pointer, offset or similar).

But the RISC-V spec doesn't only not guarantee forward process in this cases (which I guess is fine) but goes as far as explicitly stating that guaranteed not having forward provess is ok, e.g. doing any load between the load reserved and store conditional is allowed to make the store conditional fail>

Doesn't that mean that if you target RISC-V you will not benefit from LL/SC based ABA workaround and instead it's just a slightly more flexible and potential faster compare exchange which can spuriously fail?

The spec says you are supposed to detect if it work and potentially switch implementations. But how can you do that reasonable if it means that you have to switch to fundamentally different data structures, which isn't something easily and reasonably done at runtime.

Or do I miss something fundamental?

souprock · on Nov 1, 2020

The use of LL/SC for atomics is a common mistake. It makes replay debuggers like rr impossible to implement.

EE84M3i · on Nov 1, 2020

For anyone confused like me, see https://en.wikipedia.org/wiki/Load-link/store-conditional

saagarjha · on Nov 1, 2020

I am surprised that the link you posted works…

EE84M3i · on Nov 1, 2020

pacificmint · on Nov 1, 2020

Presumably because the title of the page contains a slash, which isn’t escaped in the url. Some Webservers might have interpreted as a directory.

TazeTSchnitzel · on Nov 1, 2020

Wikipedia is not made up of a series of flat files that have the same paths as you see in the URL. The URL layout is controlled by MediaWiki and it's free to do whatever it wants with the slashes. After rewriting, the URL looks like:

  https://en.wikipedia.org/w/index.php?title=Load-link%2Fstore-conditional

You can actually visit that if you like.

saagarjha · on Nov 1, 2020

It has a slash in it; I would have expected that to need escaping.

bonzini · on Nov 1, 2020

Slashes are not special in URLs (unlike #, & and semicolon). It's only servers that parse them as path separators.

zokier · on Nov 1, 2020

See section 3.3 "Path" in RFC 3986: https://tools.ietf.org/html/rfc3986#section-3.3

> A path consists of a sequence of path segments separated by a slash ("/") character

bonzini · on Nov 1, 2020

It continues like this: "Use of the slash character to indicate hierarchy is only required when a URI will be used as the context for relative references [...] The path segments "." and ".." [...] are intended for use at the beginning of a relative-path reference (Section 4.2) to indicate relative position within the hierarchical tree of names. [...] Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax."

As long as Wikipedia doesn't use relative paths it is not a problem to have slashes in the URL.

TazeTSchnitzel · on Nov 1, 2020

RFC 3986 says one thing, but what the HTTP protocol allows servers to do and they actually implement is another thing.

zokier · on Nov 1, 2020

HTTP protocol RFC pretty explicitly says that the Request-URI transmitted is a "path" and refers to URI RFC to define "path". There is really no ambiguity here, HTTP as a protocol expects resources to be organized as an hierarchy and accessed via a path, which is delimited by "/" characters. As such the protocol definitely reserves and assigns special meaning to "/" in paths.

Of course servers are perfectly free to do whatever they want, there is no HTTP police to stop them.

EE84M3i · on Nov 1, 2020

Wikipedia doesn't.

any1 · on Nov 1, 2020

LL/SC is superior to CAS in that a modification to the memory will be detected even though the value has since been set back to the original value. This avoids the ABA problem.

souprock · on Nov 1, 2020

No. CAS is superior to LL/SC. There is no possible undetected modification to the memory. That's how atomic operations work. That's the whole point of an atomic operation. It's atomic.

Botching the code can be done with either mechanism. Don't do that.

dathinab · on Nov 1, 2020

This only avoided the ABA problem if the code which is ABA prone is run in-between the LL and SC instruction.

But this is where the problem starts. E.g. for RISC when using LR/SC in a way which prevents ABA your are always losing all forward guarantees and it's totally valid for a implementation to be done in a way which will just never complete in such cases...

saagarjha · on Nov 1, 2020

Looks like someone read the Wikipedia link posted in a sibling comment ;)

any1 · on Nov 1, 2020

Well, actually, I discovered this when I was playing with implementing a lock free queue in shared memory. I was using singly linked lists; one for the queue and one for the free list. A node would sometimes come back to the queue's head from the free list and mess things up. It's not surprising that this is well known, but I learnt it by doing. :)

saagarjha · on Nov 1, 2020

Yeah, that's fair, just poking fun at Wikipedia having the exact same thing paraphrased slightly differently ;)

saagarjha · on Nov 1, 2020

Last I heard, adding a hardware counter for failed SCs may help work around this on ARM–presumably RISC-V could do the same thing here?

garaetjjte · on Nov 1, 2020

Not necessarily, it needs to allow trap on failures.