Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An ex-ARM engineer critiques RISC-V (gist.github.com)
350 points by ducktective on Nov 1, 2020 | hide | past | favorite | 249 comments


The lack of condition codes is a big deal for anyone relying on overflow checked arithmetic, like modern safe languages that do this for all integer math by default, or dynamic languages where it’s necessary for the JIT to speculate that the dynamic “number” type (which in those languages is either like a double or like a bigint semantically) is being used as an integer.

RISC-V means three instructions instead of two in the best case. It requires five or more instead of two in bad cases. That’s extremely annoying since these code sequences will be emitted frequently if that’s how all math in the language works.

Also worth noting, since this always comes up, that these things are super hard for a compiler to optimize away. JSC tries very aggressively but only succeeds a minority of the time (we have a backwards abstract interpreter based on how values are used, a forward interpreter that uses a simplified octagon domain to prove integer ranges, and a bunch of other things - and it’s not enough). So, even with very aggressive compilers you will often emit sequences to check overflow. It’s ideal if this is just a branch on a flag because this maximizes density. Density is especially important in JITs; more density means not just better perf but better memory usage since JITed instructions use dirty memory.


The idea is that high end processors will recognise these sequences of instructions and optimize them (something called macro-op fusion). Whether this is a good idea is an open question because we don't yet have such high performance RISC-V chips, but that's what the RISC-V designers are thinking. At the same time it permits very simple implementations which wouldn't be possible if the base instruction set contained every pet instruction that someone thought was a good idea.

Note macro op fusion is widely used for other architectures already, particularly ones like x86 where what the processor actually runs looks nothing like the machine code.


Two words: instruction density.

It doesn’t matter if they’re fused or not if the reduced instruction density increases memory usage and puts more pressure on I$.

Also, I don’t buy the whole fusion argument on the grounds that having to fuse super complex (5 instruction or more) sequences adds enough complexity that you’ve got opportunity cost. Much better for everyone if the CPU doesn’t have to do that fusion. That’s the whole point of good ISA design - to prevent the need for fusing in cases you’re doing something super common.


That's what the RISC-V compressed instruction encoding is all about. There is a paper which I can't find right now about how the compressed encoding achieves something similar to x86 code size on typical application code. As I said above, the jury is still out until we get very high performance RISC-V implementations which are equivalent to existing high end x86 and aarch64 designs.

Edit: Here's the thesis about the design decisions in the C encoding: https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.p... See also the diagram on page 62 of this document: https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...


First of all: I agree with you that the jury is out. We are all speculating. I might be wrong.

The thing I see is this: you can also add compressed encodings for any ISA. And that has its own costs (it’s harder to decode and it’s harder on software that wants to do bidirectional analysis of machine code). So “my isa has shortcomings but it’s cool because compression” isn’t a perfect argument since if your isa lacks those shortcomings then you still benefit from compression and you don’t need it as much, which is better.


> it’s harder to decode and it’s harder on software that wants to do bidirectional analysis of machine code

not necessarily, since

1. the compressed versions are basically 1-1 to the non-compressed versions

2. the uncompressed versions are more for study/academia and clarity; its expected that irl only compressed instructions are used (for instructions that can be compressed)

more info: https://riscv.org/wp-content/uploads/2015/11/riscv-compresse...


Do any other ISAs have compressed encodings in wide use? It seems a bit like a chicken and egg problem - why build all the silicon to handle decoding it if your base ISA is dense enough?


Arguably x86 only has a compressed encoding in the sense that super common instructions often fit in fewer than 4 bytes and some of the most used ones are 1 byte. I don’t think the compression is deliberate or optimal though.


Arguably most of the 1 byte instructions are garbage, though ;)


What, no love for AAA :-?

(IIRC they removed that one on x86-64)


Sounds pretty similar to thumb mode on ARM, no?


And ARM licensed the Hitachi SuperH patents when implementing Thumb. It's not exactly a new concept.


In addition to Thumb there are apparently at least MIPS16e ASE and PowerPC VLE.


ARM had Thumb, right? Is that a deadend at this point?


Thumb was removed from ARMv8 (AArch64). If you use 32-but legacy, it may exist on some chips; but it’s no longer the benefit it once was.


It's always been of the same value. You exchange performance for code density. They want aarch64 to be high-performance, so they removed thumb.

RISC-V compact instructions don't require special modes and run in fully-mixed mode with 32-bit instructions without all the penalties thumb has (they are literally just extended into their 32-bit counterparts internally).


> It's always been of the same value. You exchange performance for code density.

High code density became less valuable, was the point.


> having to fuse super complex (5 instruction or more) sequences

Can you give an example of someone advocating for 5 instruction fusion? Normally it's limited to three.


They have this example for general signed overflow checking in the 2.2 spec (but I'm not sure if this counts as a recommended sequence for fusion):

    add t0, t1, t2
    slti t3, t2, 0
    slt t4, t0, t1
    bne t3, t4, overflow
So two extra registers needed also..

edit: so I now think a good extension to add overflow checking to RISC-V is with an instruction that works like "slt"- call it "sov", set if add would overflow:

    add t0, t1, t2
    sov t3, t1, t2
    bnez t3, overflow
add/sov could be fused..


Fusing that would be very difficult, since it's hard to write that many registers with a single op, and I don't think they recommend it. This does mean signed overflow checking will be comparatively expensive. But thanks for the reference, I agree this is a weak point :).


You don't have to use real registers in a single op. Fusing means interpreting a pattern in the input stream and issuing a different instruction from a richer microarchitectural instruction set


You have to preserve the architectural registers at the end of the sequence. So if there are 5 registers, you either have to have a 5-register microinstruction, or issue multiple 3-register microinstructions instead.


You have to preserve architectural registers only when some other instruction actually depends on them. When you detect a dependency you lazily compute their content (causing a stall)


The dependency could be arbitrarily many millions of instructions away.

The only way to know there isn't a dependency is if that register gets clobbered by something else very soon afterwards.

But this whole topic of "checking for signed overflow is expensive" is overblown. It's simply not that important an operation, especially in the context of those languages that do it a lot also doing a lot of memory references, which are far more expensive.

Adding arbitrary completely unknown integers is pretty rare. If you know both numbers are greater than zero then a single compare-and-branch is all you need. If one of the numbers is a constant then a single compare-and-branch is all you need.


That sounds like an amazingly bad idea, because you would instead have to retain the two source registers until you can prove that you don't need the two output registers anymore, which can be arbitrarily far in the future.


At least in their papers and mailing list discussion, the limit for simple fusion is not on the number of source operands, but on the number of destination operands. A classic example is a PC-relative long jump:

  auipc lr, zero, .LONG_TARGET20
  jalr  lr, lr, .LONG_TARGET12
Only one register (the link register) is clobbered, so the pair can be fused into a single wide jump-and-link.

So in parent's example, sov might fuse with the following bnez, but it likely wouldn't fuse with the preceding addition.


Yes, I missed the edit, I was only commenting on the first part.

Quite a few things determine whether a fusion is doable or not. In addition to the number of destination registers, you do, to a more relaxed extent, care about source operands, but also things like ‘does this fit nicely in a single pass through the pipeline?’ and even just ‘is this materially beneficial?’

Lots of cores (but not all) can write two registers from a fused instruction, given the right conditions, and sov does rerun the addition, so add-sov fusion sounds very doable to me.


rotate-and-xor (and xor-and-rotate) are both common operations in ARX ciphers. They demand 4 macro-ops in RISC-V, but only one in ARMv8.

Bitfield insertion is only one instruction in most RISC ISAs, but 5 or more in RISC-V.


The (unfinalized) bitmanip extension has single-op rotates.


I really don't like it. It seems like they copied x86 (bext, bdep) where they should have been plagiarizing armv8 (BFM, ...).


I've watched that extension's development since it was little more than one smart guy's wish list. I don't think its fair to say that they copied any one architecture. The authors have put in a ton of time researching the tradeoffs and investigating the trade space over the years.

That said, until it gets ratified by the consortium and implemented in silicon its still just a (well-researched) wishlist.


I’d have to reinvestigate more than I feel like doing right now so I’ll tell you where to look.

You need fast sequences for all of these variants:

- add or sub or mul

- signed or unsigned

- 32 bit or 64 bit

Some of those need 5 instructions. I don’t remember which adventure you need to pick to get 5.


Thanks for the reply but I'm not sure what you mean. Are you saying you want to add an i32 to a u64? Or am I completely misunderstanding? I'm not sure why unsigned-unsigned or signed-signed instructions would be hard. Idk about mixed sign operations, but when do they ever occur without first casting one to the other?


I mean you will do overflow checks on the following. I’ll use the “s” and “u” prefixes to mean signed and unsigned. Unsigned matters less than signed.

sadd32, sadd64, uadd32, uadd64, ssub32, ssub64, usub32, usub64, smul32, smul64, umul32, umul64


Ah, I get you now. Yes, this looks like a weak point, and I totally get why it'd screw with Javascript optimizations.

I don't think it's a fusion problem; even if you did fuse these sequences, they'd still be bad, since they'd be writing lots of extra registers.


Yeah exactly!

And worth noting that these are just the ones where you usually have good sequences on other cpus, but even then (like signed mul) they’re not perfect. Lots of room for improvement. It’s just that risc-v didn’t seize the opportunity.


> particularly ones like x86 where what the processor actually runs looks nothing like the machine code.

(Quoted from OPs comment)

This isn’t a subject I’m an expert at but wouldn’t this mean there’s already some sort of translation going on already on the other systems? So it’s mostly just added end user work not a giant performance loss on RISC?

It would just then simply be a layer of abstraction that is lost.


Yes. x86 processors translate instructions to internal micro opcodes before scheduling or running them. Those look nothing like the x86 instructions.


What little we've seen of x86 micro ops (the great work reverse engineering the k10 microcode), shows that the micro ops look very much like x86 instructions. Still predominantly two address, RMW instructions for instance. Very similar to the original 8086's microcode structure rather than some reaction to the RISC movement, despite the common trope to state the contrary.

Can't wait to see more information about the goldmont microcode work to see if that holds for intel as well as it does for AMD.


Some huge portion of the x86 power budget is just the instruction decoder. The instruction set is so nutty that trying to implement them all directly in hardware is a non-starter.

The x86 front end breaks down big instructions into smaller RISC-like micro-ops and then fuses/re-orders/optimizes/etc and runs those instead. There’s pros and cons, the con being sheer complexity and power budget, the pros are that it’s an abstraction so the microarchitecture can change completely without recompiling your code —- and you get CPU specific optimizations too. The CPU is basically emulating x86.

You could in theory build an x86 CPU with a RISC-V core behind that decoder.


> Some huge portion of the x86 power budget is just the instruction decoder.

This is a widely held meme, but the internet at large doesn't have any evidence to back it up. A couple of publicly visible engineers that do have experience are on record as saying that cell-phone-class competitive x86 was absolutely possible. Intel and AMD chose not to pursue those markets.

The expensive parts of a high-end CPU aren't normally in the instruction decode part. They are in the branch prediction, branch mispredict recovery, forwarding networks, memory re-ordering, and so on. Anything short of a dataflow ISA has little impact on those structures.


I'm sure that you're right that the instruction decoder power budget isn't a huge issue.

Don't think its strictly accurate to say that Intel 'chose not to pursue' the mobile SoC market - IIRC they tried, made little progress and gave up having spent a lot of money in the process.


By large I mean up to 10%, and I have a study for you. [1]

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...


I skimmed the paper. They reported that the difference in total power spent in two(!) microbenchmarks when switching between L1I and the decode cache is 10% for one, and 3% for the other. The attribution of power was done entirely using a linear regression model on some core perf counters, and the core's own internal estimate for power consumption over the whole package.

I don't think you can generalize from that result to much of anything.


I don't have a source, but back when ARM Zen was going to be a thing (hopefully it still will), AMD was claiming a 10% uplift in performance over x64.

That struck me as one of the few apples to apples comparisons ever of the instruction sets at the high end, from a party not really incentivized to bend the truth one way or another.

But it could have easily come from something like the relaxed memory model, or they could have just been overly optimistic. The chip was cut after all.


You're running into the same problem, just from a different angle.

With the exception of Atom, all recent Intel designs have been a RISC core with a CISC decoder slapped on top. Everything else being equal, the simpler decoder will create a smaller chip. Because the decoder is always running all-out, the simpler decoder will also use less power.

x86 instructions are multi-length from 1 to 15 bytes. The cost to slice that up is always going to be bigger than fixed-length instructions. RISC-V has variable length in theory, but in practice, compact instructions extend into 32-bit instructions with some bits added which is important for decode cache while longer instructions are ignored.

Because there's a maximum tolerable decode latency of a very few cycles and latency increases with cache size, decode cache size has a definite cap. x86 has a couple orders of magnitude more potential instructions than RISC-V. More instructions translates into a lower hit rate for the same size cache barring any heuristics (more on that below).

Matching variable-length arrays to an unknown set of arrays in cache is inherently a hard problem. Every solution has tradeoffs and the resulting heuristics are bad for computing (see below). In contrast, searching for a match on a fixed 32-bit array has much more simple general solutions that don't require tradeoffs.

C and CPU designs feed off each other. Let's say there are instructions X and Y which can do equivalent things. x86 engineers played around with both and got a lucky insight into how to make X a bit faster. Compiler writers jump on it and start using the faster solution. x86 engineers now all but stop looking to improve Y and spend their time tinkering with X instead. Compilers now focus even more heavily around not just X, but any instructions more closely associated with X.

In that entire (true) story, nobody gave a second thought to whether the final result of Y would have been faster overall if not for the lucky break with X. If x86 developers were actually free to choose whichever instructions they wanted, x86 decode would be much, much slower than it appears to be. This self limitation argues that perhaps a more RISC-like ISA is inevitable.

A new ISA where everything is used would definitely have complexity, transistor, and power disadvantages vs a new ISA that didn't make that mistake.


> A couple of publicly visible engineers that do have experience are on record as saying that cell-phone-class competitive x86 was absolutely possible.

Weren't there Windows Phone devices with x86 SoC, but they weren't competitive?


Public Windows Phone 7, 8, and Windows Mobile 10 devices were all arm. Intel cancelled further development on Atom for phones at roughly the same time Microsoft announced Windows Mobile 10's desktop convergence feature (plug phone into monitor and get a desktop-like experience, I think there was a wireless option as well). It's pretty simple to speculate this would have been way more nifty on an x86 chip, because Windows Mobile 10 was more or less Windows 10, so you would have a bigger software library, instead of only UWP applications.

Commercially made x86 Android phones exist, the most popular to my knowledge were some of the Asus ZenPhone models.


It existed, but why bother?

When Atom was released in 2008, the A9 had already been announced (a year before). A9 was around 10-15% faster per clock and were often multicore meaning a 1.5GHz chip was faster in all metrics over the 1.6GHz Atom.

A few articles came out June/July of this year with Analysts saying that Intel had spent over 10 Billion dollars trying to break into the mobile market with no success. ARM's current R&D budget (according to Nvidia a month or so ago) is 0.5B. If they spent that much every year since 2000, they would barely match Intel, but their entire market cap in stayed under 4B all the way until 2009. Remember, that R&D includes their high-end ARM cores, but also GPU, midrange designs, various microcontroller designs, a realtime OS, ARM tooling, NPUs, various kernel support, etc

If that much money can't fix up x86 to keep up with a budget a fraction of the size, I take that as proof that the ISA really does matter.


There never were any. Intel cancelled their Broxton phone SoC in 2016: https://www.anandtech.com/show/10288/intel-broxton-sofia-sma... (A major mistake on their part IMO.) There was some speculation back then that Microsoft might have been interested in x86 phones but no proof ever emerged.


You need the icache to be post-fusion..


It’s not just the I$ though, it’s the MLC, LLC, and I-TLB. And instruction density at those levels actually matters quite a lot for big binaries (my experience in this regard is with HHVM at FB, but it’s certainly not unique).


Yeah, pretty much agreeing with you. I think if they are going to rely on fusion in the ISA, they should limit it to the most minimal fusion implementation to have the widest possible chance of adoption. So assume the hardware can only fuse two instructions and design the ISA with this restriction in mind.


Intel and AMD's experience with complex transformations on the instruction stream suggests that its overall better to do it in a L0 decode cache.


I don't think the instruction sequence from the article would qualify for macro-op fusion. Berkeley looked at this for the simplest case of LEA [1]:

  // &(array[offset])
  slli rd, rs1, {1,2,3}
  add  rd, rd,  rs2
The sequence in the article uses what Intel calls the fast case but it still wouldn't qualify for Berkeley's two instruction fusion. Dunno if anyone does three instruction fusion.

As an aside, LEA is never getting added to the base RISC-V nor should it be. But I'm surprised it isn't considered for an extension.

[1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...


> The idea is that high end processors will recognise these sequences

I'm not sure that is the idea, given that RISC-V is targeted at processors so low-end that they don't even implement multiply.


That’s not true. RISC-V targets the full spectrum of processors (well, not 8/16-bit, so not the very low end). Some of the decisions they made, compared to say OpenRISC is made just to make the ISA easier to implement with superscalar CPUs (not having branch delay slots)


Instead of fusing them, shouldn't it be possible to speculate that it will not overflow, process the check on a separate slow path and do a roll back in case it did overflow?


I'm not super familiar with compiler and processor design, but why do I want the processor to do this optimization instead of the compiler?


> Also worth noting, since this always comes up, that these things are super hard for a compiler to optimize away. JSC tries very aggressively but only succeeds a minority of the time (we have a backwards abstract interpreter based on how values are used, a forward interpreter that uses a simplified octagon domain to prove integer ranges,

RISC-V has some closely-related sharp corners in indexed address arithmetic as well. Some choices for the type of the index variable perform much worse on rv64.

Consider: an LP64 machine uses 32-bit integers for 'int' and 'unsigned', but 64-bit integers for `long`, `size_t`, `ptrdiff_t` and so on.

If you use an array index variable of type `unsigned`, then the compiler must prove that wraparound doesn't happen. That's pretty weird considering that half the point of using unsigned is to elide such proofs of correctness. If it cannot prove the absence of unsigned wraparound, then it will be forced to emit zero-extension sequences prior to using the index variable to generate the addresses.

ARMv8 side-steps the whole problem by providing indexed memory addressing modes that include the complete suite of zero and sign extension of a narrow-width index in the load or store instruction itself.

So here we have an example of a three-way system engineering choice.

  - Provide a small amount of hardware that performs the operation on-demand.
  - Provide new and inventive forms of value-range analysis in the compiler.  Despite decades of research into this problem, the world's best solutions still frequently saturate at "the entire width of the type the programmer requested".
  - Change the habits of the world's C programmers.
RISC-V chose options 2 and 3.


> If you use an array index variable of type `unsigned`

This is usually why your array indexing should be done with an iterator or size_t :)


size_t is unsigned.


(jumping up the thread to try and hop over some confusion...)

The problem isn't with unsigned types generally. Its with subregister unsigned types. So, size_t and uintptr_t are fine. uint32_t, uint16_t, uint8_t (on LP64 ABIs) are pessimized and demand zero-extension instructions (or proofs that they can be safely elided) prior to causing side-effects. uint64_t on a LP128 ABI would also be problematic.

signed 32-bit int is also fine... because RISC-V specifically has a suite of arithmetic instructions that unconditionally sign-extend from bit 31. Even without those, it would still be fine because the carve-out for undefined behavior is wide enough for INT_MAX+1 to remain positive. Same thing for all of the other narrow-width signed integer types. If you increment SHRT_MAX and then use it to index memory, its perfectly legal undefined behavior to access base + SHRT_MAX+1 instead of base + SHRT_MIN.

However, that's not legal for the unsigned types. They are all mandated to wrap in 2's complement. base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.


> base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

Ironically, given the topic, that's actually not true, because `base + UINT_MAX+1` is `(base + UINT_MAX)+1` (with a pointer, not a unsigned int, as the temporary value). That should probably be `base + (UINT_MAX+1)`.


This whole spiel is only relevant when the programmer specifies an array index in a distinct variable.


Right, but you don't need to sign extend it.


Eh? The problem with using unsigned is identical to using size_t. size_t doesn't have any special carveouts that eg. say that wrapping is undefined behavior. It's defined as an unsigned integer type. Meaning it has unsigned integer type behaviors.


unsigned is 32-bit, which means you need to have it overflow in the lower 32 bits since you're working on it through a 64-bit register. If you use a full size_t then this check is not necessary.


You have a typo: size_t doesn't need zero extension. RISC-V does have 32-bit signed integer arithmetic that unconditionally sign-extends the result from bit 31. So using a signed integer as a counter (bad practice IMO, but widely recommended in style guides) doesn't need anything extra at all.


Uh, wrong subthread?


Nitpick: s/LP64 machine/LP64 C implementation/

But isn't size_t (or ptrdiff_t) the preferred indexing type in C for this reason (among others)? Sometimes you of course do want wrap around modulo semantics but that's much rarer, right?


The problem is that providing extra bits for "sign-extension mode" and "read 32b or 64b" blows through the opcode space very quickly.


The RISC-V spec includes recommended code sequences to check for overflow, so that the hardware can potentially use insn fusion as an optimization. The "bad" cases you mention can be a bit clunky, but they should also be rare.


I’m aware of those sequences and it’s a myth that they will be rare. For dynamic languages they will be super common.


We know the origins of that myth by examining the papers that the RISC-V designers wrote. They got a C compiler back-end working and didn't incorporate any other languages in their benchmarking corpus.


This reminds me of Peter Drucker, "What gets measured gets improved". Conversely, if you don't measure it, it usually doesn't get improved.


Its a tough situation to be in, especially when RISC orthodoxy elevates measurements to a paramount position in the ISA design process. I don't think they were wrong to do so, either. The error was in finalizing the ISA without more backends to work with.

How much business code out there burns most of its cycles in Java/C#/Rust/Go? Having even one of those (lets face it: Java) would have gone a very long way. How much client computing spends most of its cycles in (JITted) javascript?

I understand the tradeoff that drives this omission, though. How many VC megabucks (Tens? Hundreds?) would they have had to spend commercially on a bet? Alternatively, how many grad student-years grinding through low payoff work to prove yea/nea on the value of checked arithmetic?

Lets say that you make a hip-shot bet on some checked arithmetic support without the supporting toolchains and application code to back up the design choice. You could end up making a different set of mistakes instead and end up in the same situation.

I think the only mistake was in finalizing the ISA without any support for checked arithmetic. My belief is that doing it well will not be orthogonal to the rest of the ISA's design, and therefore is a poor candidate for an extension.


C really did screw things up by just refusing to address overflow in any form, didn't it


I don't think C screwed up. C's objective was to standardize a set of practices based on what the hardware of the day was capable of. If they had pessimized performance on one architecture because it failed to provide efficient checked overflow (or any other feature), then C would have never made it to standardization.

ISO C is therefore a three-way compromise between three communities: Software authors, compiler authors, and CPU manufacturers. You will naturally end up with some decisions that dissatisfy some members of each.


Better would be to say that the C standard committee screwed things up by refusing to address a lot of issues over the last 35 years.


Not really, they have addresses pretty much all of the issues. Now, you can argue that they favored compiler authors when doing so, but they have consider the problems and done things in response to them.


Yes. In many ways:

1) it’s pretty important to have easy syntax for math with overflow checks. Otherwise there will be bugs. Bad bugs. We have those and it sucks.

2) it’s pretty important that if the overflow wraps in ISA then it wraps in language semantics but C just says “meh whatever” for signed overflow. That leads to even more bugs.


The HP3000 had "SPL" - Systems Programming Language -- as its way of doing what C does. It had a specific syntax for checking the error codes: if <> then <whatever> so whatever previous arithmetic did, you could check the machine's flags directly. Can't do that in C. https://en.wikipedia.org/wiki/Systems_Programming_Language


At the time C was designed not all machines did represent signed integers in 2 complement. Therefore it was not possible to define behavior for signed overflow. They probably should change that 2020^^

GCC has intrinsics for integer math with overflow checks


C++ formalized two's complement already. Formalizing power-of-two word size and 8-bit bytes might come.


That’s my suspicion too!

I don’t think these folks have seen what modern languages do.


Maybe overflow checking could included as an ISA extension. If it is included, what is the least impactful design?

Overflow is part of the result, so maybe include extra bits to each register that can be arithmetic destination. These bits are not included in moves, but could be tested with new instructions.

Another way that avoids flags is new arithmetic instructions: add but jump on overflow. Maybe this is reduced to add and skip next instruction except for overflow, but maybe things are simplified if the only allowed next instruction is a jump, so the result is a single longer instruction.


After thinking about this some more: I think the extension instruction should work like "slt" (set on less than). So we have "sov"- set if add would overflow:

    add t2, t1, t0
    sov t3, t1, t0
    bnez t3, overflow
Why this way? "extra bits on destination registers"- this is really flags. The flags have to be preserved during interrupts, so extending the registers is not so easy (I think it just reduces to classic flags).

"add but jump on overflow" or "add and skip on no overflow"- I don't like this because you can not break it into separate operations without flags. I think you might have to add hidden flags in a real implementation.

An add followed by an sov could be fused, but requires an expensive multi-register write. Fusing maybe could be more likely if the destination is always to a fixed destination register:

    add t2, t1, t0
    sov tflags, t1, t0
    bnez tflags, overflow


If the destination of sov is specified to always be tflags, then you may as well combine add and sov into one instruction, with tflags implicit:

addsov t2, t1, t0


The best design is control bits which RISC-V doesn’t have but x86 and ARM do have.

The best design is part of the core ISA and not and extension since overflow checking is fundamental to modern languages.


Agreed. Waterman's thesis says this about that:

"Several ISA features, including implicit condition codes and predicated moves, are onerous to implement in aggressive microarchitectures. Yet, their complexity often does not result in higher performance because their semantics were ill-conceived. For example, x86 provides a conditional load instruction, but, if the unconditional load were to cause an exception, it is implementation-defined whether the conditional version would do so. Thus, a compiler can only rarely use this instruction to perform the if-conversion optimization.

Recognizing the inefficiency of their conditional operations, Intel’s recent implementations go to some lengths to fuse comparison instructions and branch instructions into internal compare-and-branch operations."

Yeah, doing condition codes right is complicated but then that's the purpose of microarchitecture, to factor that complexity out.

RISC-V succeeds in having a minimalist to a fault design which is appropriate for a certain design point. ARMv8 and x86_64 are more useful for a broader set of designs. The burden now is on RISC-V to show that their minimalist approach is fast+efficient rather than just simple. You have to get something out of the simplicity; otherwise what you get is design debt.


Control bits as in ARM and x86 force serialization of arithmetic due to the RW dependency in every instruction on that bit. There are some tricks but it still needs tracking. For higher order superscalar or out of order processors this gets annoying.


Yes, the old, old way of having a single condition code register or the like (which dates back 40+ years) doesn't work well these days.

I like the Mill CPU approach, where every "register" (it doesn't have named registers actually) has the full set of status bits associated with it, and not just for overflow. Things like "not a result" (NaR), which can represent the result of a failed speculative load for example (because the process doesn't have permission to read from that page, for example).


I thought that compilers couldn't really use this effectively…


> I thought that compilers couldn't really use this effectively…

The status bits part in general, or the speculative load stuff?

They allegedly have all this working, privately. They haven't released any development tools or such to the public.

I've often toyed with the idea of writing an instruction-level simulator (as opposed to the RTL sim or whatever they have internally). But even sticking to the public information, I'd likely be infringing on their patents.


No. Control bits (status bits, flags, ...) get renamed just as registers get renamed.

Basically, if there's a bottleneck to x86 code, Intel has run into it, profiled for it and generally optimized around it both in their microarchitectures and in the their C compiler.


That's one of the tricks. But it doesn't solve the issue of clobbers, which Intel had to introduce new variants of ADD and MUL to solve. Named predicate registers make it all much easier for everyone.


Thanks. I knew about condition code renaming from discussion with an Intel compiler engineer. I didn't know about clobbers and I'll read up on that.


https://en.m.wikipedia.org/wiki/Intel_ADX is the solution Intel created.


ARM has separate instruction variants with and without setting of flags. Normally one uses the flag-less versions, so you don't have this problem.


I think what you’re saying is basically true but it’s a trade against density.

If you did overflow checks rarely then what you say is a very good point indeed. The key thing is just the frequency of this stuff in modern languages.


You need to either write a paper about this or produce some evidence.


I don’t need to do anything. You’re welcome not to take my advice. :-)

If you doubt that dynamic languages are using overflow checking in the way that I describe then it’s not because of lack of papers on the subject.


> JSC [...] a forward interpreter that uses a simplified octagon domain to prove integer ranges

Off-topic, but could you point me to more details on this? Someone (else?) recently mentioned octagon analysis in JSC in a HN thread. I grepped through the sources at the time but didn't find any indication that it exists. At least not under the name "octagon".


I didn’t call it octagon when I wrote it because I didn’t know that I reinvented a shitty/awesome version (shitty because it misses optimization opportunities, awesome because by missing them it converges super fast while still nailing the hard bits).

Look at DFGIntegerRangeOptimizationPhase.cpp


I wonder how this interacts with branch prediction. Since overflows should happen very rarely I guess the branch on overflow should almost always predict as non taken. So wouldn't it be possible to have a "branch if add would overflow" instruction or even canonical sequence that a higher end CPU can completely speculate around and just use speculation rollback if it overflows?

I think an important design point here is that the languages that need a lot of dynamic overflow checks are primarily used on beefier CPUs so if you can get around the code size issue, making it performant only on more capable designs is fine since the overflow check will be rare on simpler CPUs.


I don’t think that beefier cpu and overflow checks are that related. I mean, you’re right, I just want to place some limits on how right you are.

1. Folks totally run JS and other crazy on small CPUs.

2. Other safe languages (rust and swift I think?) also use overflow checks. It’s probably a good thing if those languages get used more on small cpus.

3. The C code that normally runs on small cpus is hella vulnerable today and probably for a long time to come. Compiling with sanitizer flags that turn on overflow checks is a valuable (and oft requested) mitigation. So theres a future where most arithmetic is checked on all cpus and with all languages.

And yeah, it’s true that the overflow check is well predicted. And yeah, it’s true that what arm and x86 do here isn’t the best thing ever, just better than risc-v.


Interestingly it seems rust only does full overflow checking in debug builds: https://huonw.github.io/blog/2016/04/myths-and-legends-about...


By default yes, but you can enable overflow checking in release mode (it’s a conf / compiler flag), and it has standard functions for checked, wrapping, and saturating ops.


Yeah I know about 1 e.g. also MicroPython, no idea if that's used outside DIY though. I agree about rust but I would think that with much stronger type safety and static compilation it should be able to remove a lot more of the overflow checks and most that remain would be needed in correct C too. At least that's what I learned from my compilers prof who worked on Ada compilers for many years and that should be quite similar. But maybe that's my biased hope as I really really hate working with dynamic languages.


The current world record holder (in the published literature) for branch prediction is TAGE and its derivatives. The G stands for Geometric. It is composed of a family of global predictors that increase in length with a geometric progression. That's somewhat relieving since it means that the storage growth is not unlike that of mipmapping in computer graphics. A small constant k times maximum history length N.

But to a first approximation, if you double the density of conditional branches in the program, then you will need to roughly double the size of the branch prediction tables to get the same performance, even if all of them are correctly predicted 100% of the time.


RISC-V spec is not yet finished.

Currently only the most basic extensions are available. But nothing prevents RISC-V from introducing in the future an extension that extends the conditional code or an extension for integer/float overflow.


I’d be curious to see the instruction sequences for handling overflow without condition codes. I’m not even sure I see how to do it as efficiently as 3 or 5 instructions :-/


One example of 3 is branching on 32-bit add overflow on a 64-bit cpu where you do a 32-bit add, a 64-bit add, and compare/branch on the result.


I enjoyed reading this a lot, I keep seeing RISC-V being touted as a potential replacement for ARM but I had yet to read a good critique of the ISA by people who know what they're talking about.

This point I didn't quite understand:

>Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.

Most successful ISAs (including ARM) have their share of extensions, coprocessors, optional opcodes etc... ARM has the various Thumb encodings, Jazelle, VFP, NEON and more. Toolchains and embedded developers are used to dealing with optional features of computers, I'm not sure why RISC-V would fare worse here.

Beyond that I notice that many of the ascribed weaknesses are shared with other RISC ISAs like MIPS (but not ARM):

- No condition codes

- Less powerful, simpler instructions that require more opcodes to do the same thing but can potentially run faster.

- No MOV instruction

- The "unconstrained extensibility" is arguably a thing on MIPS too, with the four coprocessors that can be used to implement all sorts of custom logic.

Of course ARM has been more successful than MIPS, so maybe it's a sign that those things are indeed bad idea but given that this comes from an ARM dev I wonder if part of it is not just "that's now how ARM does it".

On the other hand I must say that I was surprised that RISC-V made multiplication optional, in this day and age it seems like such a useful instructions that it's well worth the die area. Optional DIV I can understand, but an ISA without MUL? That's rough, even for small microcontroller-type scenarios.


> ARM has the various Thumb encodings, Jazelle, VFP, NEON and more.

Having done just a tiny bit of compiler development for ARM, I can assure you that having all of these variants is a pain. Making compiler writers' lives harder means you're less likely to get optimal performance. At least on the more exotic variants, but possibly even on the most common ones.


>Having done just a tiny bit of compiler development for ARM, I can assure you that having all of these variants is a pain. Making compiler writers' lives harder means you're less likely to get optimal performance. At least on the more exotic variants, but possibly even on the most common ones.

I can empathize, but isn't that just part of the job of making a compiler? Any successful, long-lived ISA is going to have extensions and revisions that will need to be handled in the toolchain. I guess my point is not so much that it isn't painful, it's more that I don't really see what makes RISC-V really different besides the fact that it's a younger ISA and therefore we don't already know for sure which extensions are going to become de-facto standard and which ones will be less common.

>I believe the author doesn't identify as a "guy".

Arg, of course the one time I don't use gender-neutral language I manage to mess it up. Edited, thanks.


As a compiler pro, I view availability of better instructions to select as an asset rather than as a liability. Sure it’s more work for me and my team but if it makes shit fast then who cares how much work it was.

One of the best lessons I got when I was being inducted into the compiler club was: compilers are hard. It’s a hard job so other people can have easier jobs. It’s ok if compilers turn complex and managing that complexity is just something you have to learn to do. I don’t think it’s true that the need for that complexity leads to lower perf.


> I believe the author doesn't identify as a "guy".

I don't think this was meant as assumption about the authors gender. The same way i wouldn't assume that there is physical, actual pain involved when you said "having all of these variants is a pain" even when you literally wrote it.


All this Thumb etc. stuff is not relevant to the 64-bit world though. AArch64 is the least fragmented of the big ISAs, with a solid list of base functionality — e.g. NEON is guaranteed to exist on everything.


That's mostly because it's new and hasn't had time to fragment. NEON everywhere is great, but eventually there's going to be a NEON 2 that obviously will only exist on newer chips. People seem to generally regard thumb encodings as a mistake so we probably won't get a repeat of that, but I'd be shocked if similar divergences don't develop over the years.


> but eventually there's going to be a NEON 2 that obviously will only exist on newer chips.

Isn't it SVE? So far, it has been only implemented AFAIK by the A64FX, which is used by the current #1 supercomputer in the TOP500 list, but it wouldn't surprise me if we start seeing it on newer 64-bit ARM chips.


SVE2 is guaranteed in ARMv9.


Neoverse V1 has SVE (with 2x256b units) and Neoverse N2 has SVE2 (w/ 2x128b units).

Note that SVE is a superset of Neon and SVE2 is a superset of SVE.


ARMv8 is 8 years old at this point.

RISC-V's 2.2 (final, stable) ISA came out in 2017 and its already fragmented.


Thumb is very useful in things like microwaves where using less memory save the manufacturer a little money.

You do have different chips with different instructions but in a very regularized way. An ARM v8.2-A chip is going to have the same instructions whether it's made by ARM, Samsung, or Apple. And when v9 comes out they'll have NEON's SVE replacement everywhere and you'll be able to use the same code regardless of whether the SIMD width is 128 bits or 512.


Is that SVE on ARM v9 confirmed?

I have yet to see any concrete details on ARM v9, generally speaking ARM v8 is pretty damn well designed I am wondering what v9 will look like.


It’s generally expected that SVE2 will be required for ARMv9. SVE2 is a more logical successor to NEON than SVE.

https://community.arm.com/developer/ip-products/processors/b...


NEON is guaranteed to exist on everything, and this means you're never going to see Aarch64 replace the Cortex M0 and M3.

That's fragmentation right there. Severe fragmentation. Two completely incompatible ISAs.

Small 32 bit RISC-V comes in smaller and lower power than An M0, and small 64 bit RISC-V is not much bigger than an M0 and is rather popular controlling something in the corner of a larger 64 bit SoC.


I don't understand why is it a pain? Are you saying it makes choosing what encoding/extension more difficult? I would only see this as a pain if you had to mix/match these extensions and one isn't guaranteed to exist when another does making you have to write several versions for every mix and match.

But I kind of assumed most ops perform well enough and important optimizations have a test case for and will get done


Example: The Commodore 128 (or the Plus/4 works too). The C128 had a more powerful CPU capable of 16-bit work, accessing more RAM, faster Disk access etc. But in general it was rarely targeted. Why? Because developers were looking for the widest degree of compatibility and that meant restricting themselves to the minimal possible subset of compatibility between the C128 and the C64. This meant that by and large the 128KB of ram went unused as most applications ran in C64 mode.

The same applies here: the more extensions and things you pile on... the less likely they are to get used unless they are de-facto mandatory. Even today you can see games getting released that won't touch AVX instructions on both Intel and AMD because of compatibility reasons.

Valve provides data to developers on penetration of various ISA extensions via the hardware survey ( https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw... ) but for RISC-V there is no way to do the same. So most utility writers will be highly constrained in what they will use in terms of expected extension use, that will have significant harms in terms of performance. Alternatively it requires recompiling for every single different target, which is also likely.

In essence: you need a good baseline of compatibility for people to expect to use. It makes moving software easier between targets. A piece of software might certify for example on R64GC but not on R32IF because the double precision emulation might not work as expected or the lack of carry could be an issue etc.


For embedded applications, which are RISC-V’s bread and butter, the hardware and software are designed hand in glove. Eg if you are designing a custom RISC-V chip for a media encoding system you will want extensions for that application, and make use of those in your development tool chain. The software for your Hardware, or at least the performance and functionality critical part, is often only targeted at your hardware, not any arbitrary RISC-V system.


This is just a tooling problem that can be solved trivially by having release builds build multiple binaries by default, for all the major extension profiles.


You completely missed the point, multiple release builds doesn't actually fix the problem of emulation not having the same behavior or timing. That has a big impact on software cert for purpose and could completely torpedo an entire project where a dev team works on a GC profile because actual hardware doesn't exist yet and is supposed to release on an I profile. There is no substitute for actual hardware.


The point on unconstrained extensibility I think is largely around the fact that anyone can add extensions wheras with ARM, the company retains control.

There are advantages to the RISC-V approach it is likely to lead to more fragmentation - and worse gives the ability for a major implementation to add proprietary extensions that are not licensed to anyone else putting smaller players at a disadvantage and leading to fragmentation, not only in the hardware but also in the software ecosystems.

Whilst you may not like ARM having control at least everyone (for a fee) has full access to the ISA and implementations.


The extensibility that leads to the danger of fragmentation for general purpose computing is a great advantage for embedded computing where your software or firmware is targeting one particular piece of hardware and doesn't have to be compatible with anyone else. Western Digital is free to put in the mix instructions they need for their hard driver controllers and NVidia is free to put in the instructions they need to control their video cards and the incompatibility between them just doesn't matter.


Which is great and fine as no-one else will be writing software for that particular hardware. I believe that ARM has been allowing some extensions for their M series designs in these circumstances - partly due to pressure from RISC-V alternatives.

I should add that I think that it's possible that Nvidia/ARM combination will remove ARMs 'level playing field' and we might see Nvidia only extensions for their designs - which would not be good. We'll have to see.


> I had yet to read a good critique of the ISA by people who know what they're talking about.

I still wonder about RISC-V. To me, it seems pointless. But a lot of companies are buying into it so I'm wrong

Why would you ever want a standard ISA? If you're buying chips you either want a cheap standard one or a powerful efficient one. To be efficient (or cheap) you'd want to only support what's required and what works best with the implementation.

I don't really understand the point of a generic ISA. Why not have some kind of bytecode or standard format (like llvm-ir) that gets optimized for the CPU and gets a native binary that doesn't need interpretation.

Like how the f* is it easier to make something regular+generic fast rather than something custom for your hardware/chip/cpu fast?

Do you want to know how many times I used XML when it's not required? 0. Do you know how many times I used SQLite or my own binary file? I lost count. SQLite has far more constraints than XML and custom binary files/formats aren't hard after you done than a few times.


Betting on smart compilers without putting in the effort to build them is how the Itanic happened.


Damn dude. I never thought of that.

So are you here right now declaring that RISC-V and all those companies are in the wrong and risc-v will be a disaster?

Cause I might agree and be with you on that lol

-Edit- I have no idea what the state of the compiler is

https://github.com/riscv/riscv-gnu-toolchain

> Warning: git clone takes around 6.65 GB of disk and download size

WTF?

If you're going that complex than... wtf?


Not at all, RISC-V doesn't need extremely clever compilers, and instead it's designed to maximize what the microarchitectures (hardware implementations) can do, and reduce unnecessary overhead.

My reply was mostly aimed at the idea that you can move the complexity from hardware into compilers: it might be possible, but we know how to build out-of-order CPU better than we know how to build smart compilers, so you have to invest a lot more research time, and it's generally a lower priority.

Even innovations from the past decade or two, like VSDG, haven't made their way into "industrial" compilers yet.

As for the size thing:

GCC and Clang are huge (at least when you include their entire change history, which git does), RISC-V is comparatively only a tiny part of them, you should probably look into that further before jumping to conclusions.

You don't even need a whole separate toolchain with Clang or Rust, the whole "need to build GCC yourself to cross-compile" is outdated GNU tradition, not some kind of technical necessity.


I've never really been a fan of this take. It seems to rest a lot on the ideas that:

1) Fusion is hard. While it can be hard (fusing x86 will be), I do not see why it would be meaningfully difficult for RISC-V hardware (except on devices so small it's better to have the simpler base ISA anyway), or compilers, who can in the worst case just treat fused pairs as their own instructions.

2) There's anything wrong with just most software assuming a fairly fixed set of extensions, as seems to have happened. If microcontrollers want to use a subset without multipliers, that doesn't mean anyone else has to care. If bitmanip is stabilized before RISC-V breaks into more common consumer use, why not assume it when writing code? It's only a problem if people make it one.

Most of the rest don't matter much in a global sense. The arguments about which operations go in which extensions might have meaningful merit, but it seems not very important to me.


Fusion is hard. At least, unconstrained fusion is hard. If you have a couple instructions pairs to fuse together, that's fine, but fusing any possible compiler output together is going to make decode even more complicated in practice.

This is why Intel publishes software optimization guides that go over what their fusions are, but it doesn't seem like the RISV-C spec is going to do that yet for many cases. And compiler authors need to know which instruction stream to generate to ensure fused execution.


Mostly the concern around the lack of instructions in RISC-V revolves around a few well-known cases (eg. indexed loads) where the instructions to fuse are pretty canonical.

There is always room for creativity, but that would be the same with or without indexed loads in the base instruction set. Any non-monopolistic hardware ecosystem has this problem; we've been able to ignore it largely on x86 since Intel had had a performance monopoly for so long, but once you have multiple competing core implementations compilers will have to worry about the edge-case performance differences.

What I'm talking about is more specific to groups of instructions that are safe to treat as fused by default. Note that even if the compiler outputs a pair of instructions but the hardware running the code doesn't fuse it, out-of-order execution means the penalty will generally be extremely small versus the best unfused instruction schedule.

RISC-V does give guidelines on which instructions are good fusion candidates. See for example section 2.13 in the bitmanip extension document.

Hardware, naturally, just has a fixed set of fusions it does.


fusion is not just hard, it’s opportunity cost. Let’s say you have budget to implement the top 10 most important fusions. On other ISAs you’d use that budget on things that aren’t about condition codes or array access.


Opportunity cost in what sense? If the ISA is simple and regular, then the silicon costs should be miniscule, and the engineering costs not meaningfully larger than if those instructions were separate unfused instructions that had separate encodings instead.


For a 5-instruction massacre that uses tmp registers along the way, I guarantee you it won’t be easy.


If I thought 5 instruction fusions were necessary, I would not be a fan of fusion either.


The checklist for what needs to be fast is every combo of:

- add or sub or mul

- 32 bit or 64 bit

- signed or unsigned.

I don’t remember which adventure you need to pick to get 5.

Source: I had to make most of these fast to make JSC competitive.

Edit: I said all, should have said most. Unsigned is less important for JS.


Some discussion back when it was written in 2019: https://news.ycombinator.com/item?id=20541144


The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust.

The ugly parts are indeed all ugly, though they have now added hint instructions.


Have decent speedups been gotten by previous CPUs by the addition of conditional moves? IIRC for some the SPECcpu impact was negligible, amd many RISCs don't have it. RISC is about quantifying this kind of thing and skipping marginal additions after all.


> Have decent speedups been gotten by previous CPUs by the addition of conditional moves?

This is not a direct answer to your question, but: I recently had to tune the conditional move generation heuristics in the GraalVM Enterprise Edition compiler. My experience has been that you can absolutely get decent speedups of 10-20% or more with a few well-placed conditional moves. The cases where this matters are rare, but they do occur in some real-world software, where sticking a conditional move in some very hot place will have such an impact on the entire application. Conversely, you can get slowdowns of the same magnitude with badly placed conditional moves.

It's a difficult trade-off, since most branches are fairly predictable, and good branch prediction and speculative execution can very often beat a conditional move.


I'm not sure about this "RISC way" stuff. From a uarch standpoint the RISC vs CISC distinction is moot and from an ISA standpoint the only real quantifiable difference seems to be being a load-store architecture.

ISAs without conditional moves tend to have predicated instructions which are functionally the same thing. I'm not actually aware of any traditionally RISC architectures that have neither conditional moves or predicated instructions. While ARMv7 removed predicated instructions as a general feature ARMv8 gained a few "conditional data processing" instructions (e.g. CSEL is basically cmov), so clearly at least ARM thinks there's a benefit even with modern branch predictors.

Conditional instructions are really, really handy when you need them. It's an escape hatch for when you have an unbiased branch and need to turn control flow into data flow.


We were talking ISAs so let's focus on that.

The quantifiability comes from measuring results when you give compilers new instructions, vs paying implementation complexity (time, money and future baggage to support the insn forever). The upsides and downsides here come in different units so it's still tricky.

Lots of instructions can be proposed with impressive qualitative speeches convincing you how dandy they are, but in the end it's down to the real world speedup yield vs the price you pay in complexity and resulting second order effects.

(In rarer cases the instructions might be added not for performance reasons but to ease complexity and cost, that's where qualitative arguments still have a place when arguing for adding instructions).

It's fine if we don't have the evidence in this thread - I was just asking on the off chance that someone can point to a reference.


It's not like someone is proposing some crazy new instruction to do vector math on binary coded decimals while also calculating CRC32 values as a byproduct. It's conditional move. Every ISA I can think of has that.


This prompted me to look through some RISC ISAs (+x86), there may be errors since I made just a cursory pass.

Seems the following have conditional moves: MIPS since IV, Alpha, x86 since PPRo, SPARC since SPARCv9

The following seem to omit conditional moves: AVR, PowerPC, Hitachi SH, MIPS I-III, x86 up to Pentium, SPARC up to SPARCv8, ARM, PA-RISC (?)

PA-RISC, PowerPC, ARM at least do a lot of predication and make a high investment to conditional operations (by way of dedicating a lot of bits in insn layout to it), but also end up using it a lot more often than conditional move tends to be used.


ARMv7's Thumb2 has general predication of hammocks via "if-then", and ARM itself had general predication. ARMv8 has conditional select, which is quite a bit richer than conditional move. POWER has "isel". Seeing an ISA evolve a conditional move later in life is pretty strong evidence that it was useful enough to include. So would modify your list to be:

ISAs that evolved conditional move:

  - MIPS
  - SPARC
  - x86
  - POWER (isel)
ISAs that started life with it:

  - ARM (via general predication)
  - Alpha
  - IA64 (via general predication)


Good list.

Observation re list of ISAs that evolved conditional move vs ISAs that omit conditional move: MIPS, POWER, x86, SPARC all targeted high power "fat core" applications at the point where it got added. AVR, Hitachi SH, PowerPC didn't add it while being driven more by low power / embedded applications. And many ISAs continued to see wide use in the pre-cmov versions of the ISA in embedded space (eg MIPS) after the additions. (PowerPC even removed it when being modeled after POWER)


To be clear for anyone not so up-to-speed on this: what AArch64 has (conditional select) is strictly less expressive than AArch32 (general predication).

The take away there is that general predication was found to be overly complex where the vast (vast!) majority of the benefit can be modelled with conditional select.


Its less than general predication, but a little bit more than cmov/csel. The second argument can be optionally incremented and/or complemented. Combined with the dedicated zero register, you can do all sorts of interesting things to turn condition-generating instructions into data. A few interesting ones include:

   y = cond ? 0 : -1;
   y = cond ? x : -x;
   x = cond ? 0 : x+1;  //< look ma, circular addressing!


Yes. There are cases where cmov is a killer beast and for example it makes your browser faster.

JSC goes to great efforts to select it in certain cases where it’s a statistically significant overall speed up. I think the place where it’s the most effective for us is converting hole-or-undefined to undefined on array load. Even on x86 where cmov is hella weird (two operands, no immediates) it ends up being a big win.


You get 2x speedup on Quicksort and all related algorithms using CMOV instructions, so: yes.

https://cantrip.org/sortfast.html


Yeah, IDK about the RISC ISAs–they seem to be designed around being architecturally simple (and I guess easy to teach?) but they really don't seem to map back to actual code at all, nor do they seem particularly grounded in hardware design either. (Or sometimes they're too close to the hardware and burn themselves…)


Could it be because of some patents that made it impossible to do it properly?


No. The just don't share the opinions. RISC-V is designed to go from the smallest possible core to high performance compute. Bit instructions will be in the 'B' extension.


Bitwise rotation instructions date back to at least the PDP-8. Even if they were patentable, the patents would have expired long ago.


> I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction...

Didn't check, but I suspect that decodes at least into two microinstructions.


Not only is it a single uop for the last 10 years of Intel chips, you can also run 2 of them per cycle.


I’m assuming a is a constant in your example and that you’re doing a(b, b, 8). That’s one cycle on modern intels I believe (I think the manual promised this for Nehalam). OP also alludes right this fact when talking about fusion.


Can you not add these instructions? At least when you use FPGA IP, it can be done. But you would have to update the toolchain etc to support these new instructions.


They have added them, it's just they're in bitmanip, which isn't finalized, nor is the extension mandatory.


I thought the scale factor is either 1,2,4 or 8?


You can combine them. For example, [rax+rax*8+1] (base register, register shifted by 8, constant).


Isn't the scale factor encoded in just two bits? (i.e. 00=1, 01=2, 10=4, 11=8)


Just edited with an example; I’m on my iPhone so assembling via nasm takes an extra minute ;)


Thanks for the example. I was assuming that a and b are variables in OPs posting.


True that. In my case it was 12+x*9, which is log2 of the page sizes on x86 (4K, 2M, 1G).


Having taken a look at the RISC-V ISA spec I'm wondering if they did cripple LL/SC (LR/SC in RISC-V).

Basically:

- LL/SC can prevent ABA if the ABA-prone part is in-between a LL and SC instruction

- To have a ABA prone problem you need some state implicitly dependent on the atomic state but not encoded in it. Normally (always?) the atomic state is a pointer and we depend on some state behind the pointer not changing in a context of a ABA situation (roughly ~ switch out ptr, change ptr target, switch back in ptr, through often more complex to prevent race conditions).

This means in all situations I'm aware of LL/SC only prevents the ABA problem if you at least can do one atomic (relaxed ordering) load "somehow" depending on the LL load. (LL load pointer, offset or similar).

But the RISC-V spec doesn't only not guarantee forward process in this cases (which I guess is fine) but goes as far as explicitly stating that guaranteed not having forward provess is ok, e.g. doing any load between the load reserved and store conditional is allowed to make the store conditional fail>

Doesn't that mean that if you target RISC-V you will not benefit from LL/SC based ABA workaround and instead it's just a slightly more flexible and potential faster compare exchange which can spuriously fail?

The spec says you are supposed to detect if it work and potentially switch implementations. But how can you do that reasonable if it means that you have to switch to fundamentally different data structures, which isn't something easily and reasonably done at runtime.

Or do I miss something fundamental?


The use of LL/SC for atomics is a common mistake. It makes replay debuggers like rr impossible to implement.



I am surprised that the link you posted works…


Why?


Presumably because the title of the page contains a slash, which isn’t escaped in the url. Some Webservers might have interpreted as a directory.


Wikipedia is not made up of a series of flat files that have the same paths as you see in the URL. The URL layout is controlled by MediaWiki and it's free to do whatever it wants with the slashes. After rewriting, the URL looks like:

  https://en.wikipedia.org/w/index.php?title=Load-link%2Fstore-conditional
You can actually visit that if you like.


It has a slash in it; I would have expected that to need escaping.


Slashes are not special in URLs (unlike #, & and semicolon). It's only servers that parse them as path separators.


See section 3.3 "Path" in RFC 3986: https://tools.ietf.org/html/rfc3986#section-3.3

> A path consists of a sequence of path segments separated by a slash ("/") character


It continues like this: "Use of the slash character to indicate hierarchy is only required when a URI will be used as the context for relative references [...] The path segments "." and ".." [...] are intended for use at the beginning of a relative-path reference (Section 4.2) to indicate relative position within the hierarchical tree of names. [...] Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax."

As long as Wikipedia doesn't use relative paths it is not a problem to have slashes in the URL.


RFC 3986 says one thing, but what the HTTP protocol allows servers to do and they actually implement is another thing.


HTTP protocol RFC pretty explicitly says that the Request-URI transmitted is a "path" and refers to URI RFC to define "path". There is really no ambiguity here, HTTP as a protocol expects resources to be organized as an hierarchy and accessed via a path, which is delimited by "/" characters. As such the protocol definitely reserves and assigns special meaning to "/" in paths.

Of course servers are perfectly free to do whatever they want, there is no HTTP police to stop them.


Wikipedia doesn't.


LL/SC is superior to CAS in that a modification to the memory will be detected even though the value has since been set back to the original value. This avoids the ABA problem.


No. CAS is superior to LL/SC. There is no possible undetected modification to the memory. That's how atomic operations work. That's the whole point of an atomic operation. It's atomic.

Botching the code can be done with either mechanism. Don't do that.


This only avoided the ABA problem if the code which is ABA prone is run in-between the LL and SC instruction.

But this is where the problem starts. E.g. for RISC when using LR/SC in a way which prevents ABA your are always losing all forward guarantees and it's totally valid for a implementation to be done in a way which will just never complete in such cases...


Looks like someone read the Wikipedia link posted in a sibling comment ;)


Well, actually, I discovered this when I was playing with implementing a lock free queue in shared memory. I was using singly linked lists; one for the queue and one for the free list. A node would sometimes come back to the queue's head from the free list and mess things up. It's not surprising that this is well known, but I learnt it by doing. :)


Yeah, that's fair, just poking fun at Wikipedia having the exact same thing paraphrased slightly differently ;)


Last I heard, adding a hardware counter for failed SCs may help work around this on ARM–presumably RISC-V could do the same thing here?


Not necessarily, it needs to allow trap on failures.


Sorry in advance, this is going to be a little off topic: while I agree with the technical points in the article and the comments here about RISC-V deficiencies, I hope for a world where Free Software licenses and open hardware rule. (And I admit to some hypocrisy since my daily drivers are the Apple closed ecosystem and Google Workplace.)

What gives me some hope is that an open hardware and Free Software world would help so many people, businesses, and governments. I think it would be a rising tide that lifted all boats except for specific tech industries.

That said, good article.


The misconception you present here seems to be widely held, so I'm actually going to upvote you; RISC-V being an open ISA just means that the standards describing the ISA are open, there is no relation with open source hardware. I.e., proprietary RISC-V cores exist, as do open source ARM cores.


These are all good points but fixing many would interfere with the ability of students to create a basic RISC-V processor in a single semester. The simplest possible in order RISC-V and the simplest possible out of order RISC-V designs are a lot easier to do than the equivalents in more common ISAs and that makes it really useful as a teaching tool.

EDIT: Also, when the B extension becomes standard that should fix some of the issues.


Good point. I think the author forgot that RISC-V was motivate as a ISA for both teaching and being giving decent performance in the real world. Seems like some things have to give in the name of simplicity.


Forgot? I think that's a fairly well-known fact about RISC-V, but a good ISA for teaching and a good ISA for the real world are not necessarily the same.


> Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)

JALR for call and return used the same opcode but are two different instructions, there is no need for "extra logic" for the decode or for branch prediction.

The lack of "register + shifted" could easily be circumvented by adding an extension for "complex arithmetic instructions".

And macro op fusion is a common solution that already exists in modern CPUs to increase the number of stages in the pipeline.

> Multiply and divide are part of the same extension

An extension can easily be partially supported in hardware (e.g. multiplication) and leave the other instructions emulated in software (e.g. divisions)

> No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common

But some microcontrollers do not need atomics. And if you are designing a microcontroller that does, just include the atomic extension.

Many of the criticisms made here are incorrect and are due more to a misunderstanding of RISC-V than to RISC-V design flaws


Every design has its flaws. Intel's instruction set is probably closer to the lower end of the spectrum than the opposite. Still there are very performant implementations (probably at the price of increased power consumption).

Any educated opinions how bad the RISC-V problems are compared to others when looking at the big picture?


Personally I think the POWER instruction set is better in many ways. It has a proven track record of high performance and embedded implementations. A lot of ISA design is about avoiding patents and other traps. Ultimately it doesn’t really matter because most of the value is in the implementation and the software is the commoditized complement.

Truly innovative designs don’t actually use RISC encoding and hide the details of instruction encoding. See for example NVIDIAs PTX , which gets mapped to hardware specific instruction streams, which are nothing like a traditional RISC architecture.

To me RISC-V is another example of how open source often produces lowest common denominator copy cat versions of ideas that are 30+ years old. Just like Linux. The sad thing is that this often kills innovation and locks in suboptimal designs for a long time, because it is hard to compete against something that is free.


RISC-V was not designed to be the most optimal ISA however. It was designed to be a teachable ISA which could also be used to implement fast hardware.

The need to be easily reachable puts constrains on the design but is also a benefit in that many people will know this ISA and be able to make tools for it.


POWER was designed to be a compiler writer's dream and has some sharp implementation corners.

I think I would probably recycle the Alpha ISA circa 21164 (EV-5) with maybe a CAS instruction. It was pretty balanced between hardware and software and a lot of the complications in the VLSI design (dynamic logic, mostly) are moot with a modern technology if you stick with reasonable speeds.

Presumably now that the MIPS unaligned byte access patents are expired, a whole bunch of the idiocy that Alpha had to abide to avoid that patent can just be sidestepped.


> A lot of ISA design is about avoiding patents and other traps.

Aren't patents limited to 20 years? Can't you just ignore all inventions that are < 20 years old to be safe from patents? Then you'd still be 10 years ahead of POWER.


Power ISA had updates in 2007, 2009, 2013, 2015, and 2020.

https://en.wikipedia.org/wiki/Power_ISA#Specifications


I think anybody wanting to implement POWER nowadays means the PowerPC ISA, as implemented in modern POWER CPUs, not the original POWER ISA from 1990.


Can you summarize, or point to a discussion of, the ways in which PTX differs from a traditional RISC architecture?


I saw this when it was a twitter thread[1] and noted then, some good points and some not so good points.

One of the things that comes up with Risc-V a lot is code density, and it is a sore spot for many CPU designers because it is shamelessly abused by marketing departments as a measure of 'goodness'. This has sort of trained these engineers to flinch when something doesn't exhibit good code density, and the author is no exception.

However, FLASH/RAM volume has gone up hugely. This is in part because once you run out of logic to lay down in a chip you flood fill the rest with RAM and/or FLASH because hey the chip has to be big enough to hold pad landings for all of its pins. This has taken a lot of pressure off the code density thing and now it seems appropriate to look at algorithmic capacity.

I agree with the author that it is a much better use of resources to put a hard multiplier on a chip than it is to have more space in flash so that you can do that with instructions. But from a algorithmic capacity question? It is all Turing complete so really what is the practical difference?

Now I started life programming on PDP-11's that had a "native" instruction set that was not unlike RISC-V in being maximally simple. We called it "microcode" :-) And the "real" instructions were actually sequences of microcode in a microcode ROM. That was pretty cool because you could swap out floating point instructions for vector instructions or string handling instructions if you wanted.

I will not be surprised in the least if people design "custom instruction sets" that layer on top of a RISC-V core, just like layering a front end stack on web assembly. And the available resources to do that are pretty plentiful.

One of the coolest architectures I got to play with was the Xerox "D" machines. They took this to an extreme and some amazing software was written for them. You could get really close to maximizing utilization. It was very interesting to load a new instruction set, recompile your Mesa code, and have it run faster with no changes to the hardware at all.

Of course DEC and Xerox didn't invent this, the IBM 360 had a big button on the front "IMPL" which was "Initial Micro Program Load" which prepped the instruction set for what ever OS you were about to start up.

And now you can build systems like this with open source tools and off the shelf FPGA dev boards. Such a great time to be interested in systems architecture.

[1] https://twitter.com/erincandescent/status/115453579942322995...


> However, FLASH/RAM volume has gone up hugely. This is in part because once you run out of logic to lay down in a chip you flood fill the rest with RAM and/or FLASH because hey the chip has to be big enough to hold pad landings for all of its pins. This has taken a lot of pressure off the code density thing and now it seems appropriate to look at algorithmic capacity.

I don’t agree. I’m working on a chip right now where NVM and RAM is extremely constrained. There will probably be many millions made, so it’s still relevant.

On the last chip I worked on, it was perhaps true to some degree. But that chip, even though it was a microcontroller, had a decent cache. Access to NVM is slow. And code density can have a significant impact on cache performance.


It would be nice to have some context around why these decisions were made. Are they outright mistakes? Trade-offs about the goals of thr ISA? Or trade-offs that the author thinks are bad ones?


How often do you read a single array element that is not the first element? Usually you iterate over it


> How often do you read a single array element that is not the first element

Quite often in some applications, for example hash tables (a very popular data structure).


[Citation needed]

:-)


So many flaws, and without even mentioning the missing POPCOUNT.

(No, the M extension does not help.)


Considering none of major players in CPU world are implementing RISC-V (nor will in near future) this rant is same as ranting "what if Hitler developed atomic bomb before US" that History channel likes to throw sometime. An exercise in imagination.


There are a number of serious companies using RISC-V in embedded contexts where a number of the author's criticisms don't apply.


Are you saying that RISC-V was never meant to be implemented? Or is nobody implementing it because this guy is correct and RISC-V is bad?

I mean, whatever the exact goal of RISC-V, I don't see any reason why an analysis and critique would not be relevant. Nobody invests resources into this with the intention that it gets ignored.

Also, Wikipedia has a list of RISC-V implementations.


You are so completely wrong.


When designing an ISA, I assume a basic step is figuring out which instructions to include.

For each instruction, I would guess that a 'draft' compiler is produced that can use that instruction in its code generation steps, and a few 'draft' CPU designs are made which include support for that instruction.

Then cycle accurate simulations can be done on a set of test benches to see how the addition of that instruction affects performance, power, and code size across a wide array of different usecases.

In the case of RISC-V, where some CPU extensions might be emulated, I would expect the test to also cover the performance hit of emulation of the extension for those machines without native support.

If all of that was done, and still it made sense to add the instruction, then most of these critiques aren't valid - since there will be hard data that the approach taken was the best one.

Perhaps when RISCV was a young project, too many design decisions were made without the massive compute farm to do all these simulations, or before more complex multi-issue CPU designs were added to it, and therefore some decisions aren't optimal?


Academia pulled in too much.

Small ISA != Small transistor count.

People will inevitably try to throw more transistors on the ISA limitations.


This isn't true. Berkeley produced several real RISC-V ASICs and adjusted the encoding according to their findings before RISC-V got any wider attention. Also the team had a long history making real chips before RISC-V, dating back to the early 80s.


> This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns)

They are all kinds of changes in flow, it's not crazy to have an instruction for that primitive.

He complains Risc-V need 4 instructions to do what x86_64 and arm does in two, but... it says Risc-V. And x86_64 CISC instructions devolve to a pile of microcode anyway.

Guy invested in powerful incumbant using a completely different and even more established enumbant to bash the challenger over the head with doesn't feel like I am learning anything useful when I read it.


They got the riscv code slightly wrong (but the instruction count correct).

(puts on chip designer's hat) Essentially the ARM/x64 case turns the load memory address calculation into a 4-input adder (so 2 layers of adders) and maybe a couple of extra gates because there are multiple addressing modes. Riscv's equivalent is a 2-input adder. Those get into a critical cache (and TLB) access path and that limits how fast your CPU's core clock can be (or forces you to split that path into 2 clocks).

Essentially that's part of the whole RISC idea - simple means faster - you can run your core clocks faster if the decode (and address calculations etc etc) are simpler - getting rid of lots of addressing modes was a big part of the original RISC movement. I think all 4 of those riscv instructions are 16-bit ones so they may even fit into the same space as the 2 ARM ones (haven't hacked on ARM for a while)

BTW chances are that that x86_64 mov instruction is not being devolved into more than one internal uOp (might be two if they separate off the address calculation into it's own uOp)


> Essentially that's part of the whole RISC idea - simple means faster - you can run your core clocks faster if the decode (and address calculations etc etc) are simpler - getting rid of lots of addressing modes was a big part of the original RISC movement.

Nobody believes this anymore, not even the RISC guys. Look at Apple chips; the advantage of a simpler instruction set is width (degree and depth of superscalar execution) and the ability to put more optimizations in hardware. Clocks are all bound by roughly the same limits nowadays.


I'd disagree - but I'm talking about the difference between a 1Ghz clock and a 5Ghz clock - it's always going to matter at the cutting edge - if the clocks are equal then what you're trading off is pipeline depth - which effects all sorts of stuff like the cost of mispredicted branches (and as a result the sizes and complexity of branch predictors)


> Essentially that's part of the whole RISC idea - simple means faster

The problem with this is that today, your clock speed is bound by neither your decode nor your ALUs. Only implementing weaker ALUs makes sense from an optimization standpoint if it buys you more clock speed. But as it doesn't, it just leaves you competing with another CPU that has the same clock speed as you do, and which does a lot more per clock than you do.


Turns out faster clocks don't help much if your logic isn't getting faster too. You do less work per cycle, while your cycle overhead remains fixed, resulting in less work done per unit time. If it wasn't for branch misprediction, you'd be better off with very deep pipes and slower clocks - see GPUs.


I think the point of GP is that not every instruction makes use of the more complicated (4 input) adder, so not every instruction should have to pay for the latency cost associated.

But at other pointed out this only makes sense if the ALU is the critical path and that having a two level adder significantly impacts the latency


Actually it's unlikely that this 4-input adder is in a general purpose ALU, usually they're directly in the cache access path (there are games you can apply to cache and TLB accesses if the low bits of that adder come out early), or in a separate addressing ALU which adds an extra clock to a load or store


I assume they still count towards defining the shortest possible pipeline stage duration


It depends on how you split it up of course, but there's another cost, if you pipeline the address calculation you effectively make the cpu's pipe longer and more importantly increase the cost of a mispredicted branch - which is where we're all spending our time trying to mitigate these days


> Essentially that's part of the whole RISC idea - simple means faster - you can run your core clocks faster if the decode (and address calculations etc etc) are simpler

How does this philosophy fare now that we're generally at the top end of feasible clock speeds for processors, given realistic power and related cooling budgets. Intel and AMD (and I assume others, but I don't follow them as closely) have been increasing throughput by doing more per clock.

It would seem VLIW is the recipie for simple hardware, doing more per clock cycle; but that hasn't had good results either.


VLIW is (IMHO, and I've built them) an architectural dead end a bit like the original MIPS delay slots it made sense at one point in one particular process, they don't scale well over time


x86 chips have greater top clock speeds by a significant margin. A lot of x86 cpus run in the 4-5Ghz range meanwhile ARM cpus usually stick to 2-3GHz. So the vast majority of CPU designs are not hitting their theoretical limits. They are only hitting thermal limits.

The only real benefit is that fixed length instructions are easier to decode. That's about it. If you can get away with fewer instructions then that's what you should do.


>[She] complains Risc-V need 4 instructions to do what x86_64 and arm does in two, but... it says Risc-V.

So… what, it should take 5 instructions?

Executing more instructions for a (really) common operation doesn't mean an ISA is somehow better designed or "more RISC", it means it executes more instructions.

>And x86_64 CISC instructions devolve to a pile of microcode anyway.

Some people seem to have this impression that like every x86 instruction is implemented in microcode (very, very few of them are) and even charitably interpreting that as "decodes to multiple uops" (which is completely different) is still not right. The mov in the example is 1 uop.


> Executing more instructions for a (really) common operation doesn't mean an ISA is somehow better designed or "more RISC", it means it executes more instructions.

True. But as bonzini points out (or rather, hints at) in https://news.ycombinator.com/item?id=24958644, the really common operation for array indexing is inside a counted loop, and there the compiler will optimize the address computation and not shift-and-add on every iteration.

See https://gcc.godbolt.org/z/x5Mr66 for an example:

    for (int i = 0; i < n; i++) {
        sum += p[i];
    }
compiles to a four-instruction loop on x86-64 (if you convince GCC not to unroll the loop):

    .L3:
        addsd   xmm0, QWORD PTR [rdi]
        add     rdi, 8
        cmp     rax, rdi
        jne     .L3
and also to a four-instruction loop on RISC-V:

    .L3:
        fld     fa5,0(a0)
        addi    a0,a0,8
        fadd.d  fa0,fa0,fa5
        bne     a5,a0,.L3
This isn't a complete refutation of the author's point, but it does mitigate the impact somewhat.


That's fair. It's definitely not a killer, (or even in my opinion the worst thing about RISC-V,) just another one of these random little annoyances that I'm not really sure why RISC-V doesn't include.


One common use of array indexing walks the array sequentially.

But hash tables are used here and there, also in loops.

Some people know them as "dictionaries" or "key/value stores".


The author is a woman fyi :)


According to their Twitter, the author uses they/them pronouns.

https://twitter.com/erincandescent


From a software perspective, changes in flow are identical. But from a hardware standpoint, local jumps, indirect calls and returns are all predicted differently. The spec actually has suggested forms for each kind (the author points them out in the article), so a good portion of encoding space is used on a large number of variants that will have poor performance and never be seen in practice. Especially when those bits could be used for far better purposes.


Having more instructions for common tasks also puts more pressure on your memory bandwidth and instruction cache. Execution time alone is not the only factor in performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: