“Risc V greatly underperforms”

snvzz · on Dec 2, 2021

I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

But ultimately, the gist of their argument is this:

>Any task will require more Risc V instructions that any contemporary instruction set.

Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

adrian_b · on Dec 2, 2021

I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.

Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.

Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.

The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.

snvzz · on Dec 2, 2021

You seem to be making your whole argument around some facts which you got wrong. The central points of your argument are often used in FUD, thus they are definitely worth tackling here.

>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM

the code size advantage of RISC-V is not artificial academic bullshit. It is real, it is huge, and it is trivial to verify. Just build any non-trivial application from source with a common compiler (such as GCC or LLVM's clang) and compare the sizes you get. Or look at the sizes of binaries in Linux distributions.

>the so-called better results were for the compressed extension, not for the normal encoding.

The C extension can be used anywhere, as long as the CPU supports the extension; most RISC-V profiles require it. This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode. Effort was put in making this very cheap for the decoder.

The common patterns where number of instructions is larger are made irrelevant by fusion. RISC-V has been thoroughly designed with fusion in mind, and is unique in this regard. It is within its right in calling itself the 5th generation RISC ISA because of this, even if everything else is ignored.

Fusion will turn most of these "2 instructions instead of one" into actually one instruction from the execution unit perspective. There's opportunities everywhere for fusion, the patterns are designed in. The cost of fusion on RISC-V is also very low, often quoted as 400 gates, allowing even simpler microarchitectures to implement it.

jaytaylor · on Dec 2, 2021

I was surprised to find the top gOggle hits for "RISC-V fusion" (because I don't know WTF it even is) point to HN threads. Is this not discussed prominently elsewhere on the 'net?

https://news.ycombinator.com/item?id=25554865

https://news.ycombinator.com/item?id=25554779

Is the Googrilla search engine really is starting to suck more and more, or is there something else going on in this case?

The threads read more like a incomplete explanation with a polarized view than anything useful for understanding what fusion means in this context.

Overall is give the ranking a score of D-.

mananaysiempre · on Dec 3, 2021

> ... is there something else going on in this case?

There is: the term comes from general CPU design terminology and is not specific to RISC-V, although Google does find some RISC-V-specific materials for me given your query[1–3]. Look for micro- and macro-op fusion in Agner Fog’s manuals[4] or Wikichip[5], for example.

[1]: https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu...

[2]: https://reviews.llvm.org/D73643

[3]: https://erik-engheim.medium.com/the-genius-of-risc-v-micropr...

[4]: https://www.agner.org/optimize/#manuals

[5]: https://en.wikichip.org/wiki/macro-operation_fusion

mda · on Dec 3, 2021

Try 'risc V instructuin fusion" or "Risc V macro op fusion' Not that hard really. It is a well known subject.

I hope people could stop whining everytime they mess up a search query.

ksec · on Dec 3, 2021

>I hope people could stop whining everytime they mess up a search query.

I dont know. I have never seen anyone using the term "Fusion" by itself, may be it is specific to RISC-V crowd? It is always "macro-op fusion". So your parent 's search parameter isn't something out of order for someone how knows very little about hardware. And HN are full of Web developers so abstracted in the hierarchy they knows practically zero about hardware.

And to be quite frankly honest the GP's point about Fusion had me confused for a sec as well.

theresistor · on Dec 2, 2021

> This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode.

This is disingenuous. arm32's Thumb-2 (which has been around since 2003) supports both 16-bit and 32-bit instructions in a single mode, making it directly comparable to RV32C.

snvzz · on Dec 2, 2021

Your statement does not run counter to mine quoted.

Thumb-2 is better designed than Thumb was, but it is still a separate CPU mode.

And it got far less use than it deserved, because of this. It doesn't do everything, and switching has a significant cost. This cost is in contrast with RISC-V's C extension.

brandmeyer · on Dec 3, 2021

Comparing RISC-V's "C" extension to classic Thumb when Thumb-2 is 17 years old is like comparing RISC-V's "V" extension to classic SSE when AVX-512 and SVE2 are already available. Its an insidious form of straw-man attack that preys on the reader's ignorance.

> [Thumb-2] doesn't do everything, and switching has a significant cost.

Technically true, but irrelevant. Cortex-M is thumb-only and can't switch. Cortex-A processors that support both Thumb and ARM instructions almost never actually switch at all.

AshamedCaptain · on Dec 3, 2021

> Cortex-A processors that support both Thumb and ARM instructions almost never actually switch at all.

That is not correct. At least before ARMv8, most processors that could run both Thumb and ARM switch very frequently, up the point some libraries could be Thumb while others were ARM (i.e. within the same task!). A lot (but not all) of Android for ARMv7 is actually Thumb(-2). This is why "interworking" is such a hot topic.

Also, contrary to what the above poster says, switching does not have a "high cost", it is rather similar to the cost of a function call.

stephencanon · on Dec 3, 2021

> it is rather similar to the cost of a function call.

It literally is a function call, most of the time.

And yeah, thumb-2 was the preferred encoding for 32b iOS and Android, and the only supported encoding for Windows phone, so it was used on billions of devices.

lonjil · on Dec 2, 2021

The ARMv8-M profile is Thumb-only, so on ARM microcontroller platforms there is no switching at all, and it does do everything, or at least everything you might want to do on a microcontroller, and has of course gotten a very large amount of use, considering how widely deployed those cores are.

Dylan16807 · on Dec 2, 2021

Is thumb-only particularly good for density, compared to being able to mix instruction sizes?

lonjil · on Dec 2, 2021

Thumb has both 16-bit and 32-bit instructions.

Dylan16807 · on Dec 2, 2021

Oh, you meant thumb and thumb-2.

lonjil · on Dec 2, 2021

"thumb-2" isn't really a thing. It's just an informal name from when more instructions were being added to thumb. it's still just thumb.

brucehoult · on Dec 3, 2021

Thumb2 is a thing. Thumb is purely 16 bit instructions. Thumb2 is a mix of 16 bit and 32 bit instructions.

As an illustrative example, in Thumb when the programmer writes "BL <offset>" or "BX <offset>" the assembler creates two undocumented 16 bit instructions next to each other which together have the desired effect. If you create those instructions yourself using e.g. .half directives (or if you're writing a JIT or compiler) then you can actually put other instructions between them, as long as you don't disturb the link register.

In Thumb2 the bit patterns for BL and BX are the same, but they are an ACTUAL 32 bit instruction which can't be split up like it can in Thumb.

Taniwha · on Dec 2, 2021

The main distinction is that the 16-bit RISCV-C ISA exactly maps to existing 32-bit RISCV instructions, its implementation only occurs in the decode pipe stage

bpye · on Dec 2, 2021

The C extension is that, an extension. A RISC-V core with the C extension should still support the long encoding as well. There is no 16-bit variant specified, only 32, 64 and 128.

There is an E version of the ISA with a reduced register set, but this is a separate thing.

brucehoult · on Dec 2, 2021

You are mixing up integer register size and instruction length.

RISC-V has variants with 32 bit, 64 bit, or (not yet fully specified or implemented) 128 bit registers.

RISC-V has instructions of length 32 bits and, optionally but almost universally, 16 bit length.

bpye · on Dec 2, 2021

Ah yes, I misunderstood the original comment as implying that RISC-V C had 16 bit register length, rather than opcode length.

Taniwha · on Dec 2, 2021

The E version only has half as many registers

ddingus · on Dec 3, 2021

So there really are more instructions in RAM, right?

And then they get combined in the CPU, right?

Won't those instructions need to be fetched / occupy cache?

orra · on Dec 2, 2021

> the so-called better results were for the compressed extension, not for the normal encoding.

Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.

brucehoult · on Dec 2, 2021

Indeed.

The "C" extension is technically optional, but I'm not aware of anyone who has made or sold a production chip without it -- generally only student projects or tiny cores for FPGAs running very simple programs don't have it.

My estimate is if you have even 200 to 300 instructions in your code it's cheaper to implement "C" than to build the extra SRAM/cache to hold the bigger code without it.

adrian_b · on Dec 3, 2021

The compressed encoding has good code density, but low speed.

The compressed RISC-V encoding must be compared with the ARMv8-M encoding not with the ARMv8-A.

The base 32-bit RISC-V encoding may be compared with the ARMv8-A, because only it can have comparable performance.

All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A.

When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer.

There are good reasons to use RISC-V for various purposes, where either the lack of royalties or the easy customization of the instruction set are important, but claiming that it should be chosen not because it is cheaper, but because it were better, looks like the story with the sour grapes.

The value of RISC-V is not in its instruction set, because there are thousands of people who could design better ISAs in a week of work.

What is valuable about RISC-V is the set of software tools, compilers, binutils, debuggers etc. While a better ISA can be done in a week, recreating the complete software environment would need years of work.

FullyFunctional · on Dec 3, 2021

> The compressed encoding has good code density, but low speed.

That's 100% nonsense. They have the same performance and in fact, some pipelines can get better performance because they fetch a fixed number of bytes and with compressed instructions, that means more instructions fetched.

The rest of the argument falls apart resting on this fallacy.

adrian_b · on Dec 3, 2021

They have the same performance only in low performance CPUs intended for embedded applications.

If you want to use a RISC-V at a performance level good enough for being used in something like a mobile phone or a personal computer, you need to simultaneously decode at least 8 instructions per clock cycle and preferably much more, because to match 8 instructions of other CPUs you need at least 10 to 12 RISC-V instructions and sometimes much more.

Nobody has succeeded to simultaneously decode a significant number of compressed RISC-V instructions and it is unlikely that anyone would attempt this, because the cost in area and power of a decoder able to do this is much larger than the cost of a decoder for simultaneous decoding of fixed-length instructions.

This is the reason why also ARM uses a compressed encoding in their -M CPUs for embedded applications but a 32-bit fixed-length encoding in their -A CPUs for applications where more than 1 watt per core is available and high performance is needed.

brucehoult · on Dec 4, 2021

You're just making stuff up.

ARM doesn't have any cores that do 8 wide decode. Neither do Intel or AMD. Apple has, but Apple is not ARM and doesn't share their designs with ARM or ARM customers.

Cortex X-1 and X-1 have 5 wide decode. Cortex A78 and Neoverse N1 have 4 wide decode.

ARM uses compressed encoding in their 32 bit A-series CPUs, for example the Cortex A7, A15 and so on. The A15 is pretty fast, running at up to 2.5 GHz. It was used in phones such as the Galaxy S4 and Note 3 back before 64 bit became a selling point.

Several organisations are making wide RISC-V implementations. Most of them aren't disclosing what they are doing, but one has actually published details of how it's 4-8 wide RISC-V decoder works -- they decode 16 bytes of code at a time, which is 4 instructions if they are all 32 bit instructions, 8 instructions if they are all 16 bit instructions, somewhere between for a mix.

https://github.com/MoonbaseOtago/vroom

Everything is there, in the open, including the GPL licensed SystemVerilog source code. It's not complex. The decode scheme is modular and extensible to as wide as you want, with no increase in complexity, just slightly longer latency.

There are practical limits to how wide is useful not because you can't build it, but because most code has a branch every 5 or 6 instructions on average. You can build a 20-wide machine if you want -- it just won't be any faster because it doesn't fit most of the code you'll be executing.

amelius · on Dec 3, 2021

Doesn't decompression imply that there is some extra latency?

socialdemocrat · on Dec 3, 2021

No, it is just part of the regular instruction decoding. It is not like it is zip compressed. It is just 400 logic gates added to the decoder… which is nothing.

FullyFunctional · on Dec 3, 2021

All the implementation I know of all does the same thing: they expand the compressed instruction into a non-compressed instruction. For all (most?), this required an additional stage in the decoder. So in that sense, supporting C mean a slight increase in the branch mispredict penalty, but the instruction itself takes the same path with the same latency regardless of it being compressed or not.

Completely aside, compressed instruction hurt in a completely different way: as specified RISC-V happily allows instructions to be split across two cache lines, which could be from two different pages even. THIS is a royal pain in the ass and rules out certain implementation tricks. Also, the variable length instructions means more stages before you can act on the stream, including for example rename them. However a key point here is that it isn't a per-instruction penalty, it a penalty paid for all instruction if the pipeline support any variable length instructions.

amelius · on Dec 3, 2021

Yes, but the logic signal needs to ripple through those gates, which takes time.

kragen · on Dec 4, 2021

It potentially adds latency, but doesn't drop throughput.

dragontamer · on Dec 2, 2021

> The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).

cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.

So they're equal in practice.

-----

ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.

adrian_b · on Dec 3, 2021

The fusion influences only the speed, not the code size and the discussion was about the code size.

For x86, cmp/jnz must be 5 bytes for short loops or 9 bytes for long loops, because the REX prefix is normally needed. x86 does not have address modes with auto-update, like ARM or POWER, so for a minimum number of instructions the loop counter must also be used as an index register, to eliminate the instructions for updating the indices.

Because of that, the loop counter must use the full 64-bit register even if it is certain that the loop count would fit in 32-bit. That needs the REX prefix, so the fused instruction pair needs either 5 bytes (for 7-bit branch offsets) or 9 bytes, in comparison with 4 bytes for RISC-V.

So RISC-V gains 1 byte about at every 20 bytes from the branch instructions, i.e. about 5%, but then it loses more than this at other instructions so it ends at a code size larger than Intel/AMD by between 10% and 50%.

socialdemocrat · on Dec 3, 2021

By combining instruction compression and macro-op fusion you get the net effect of looking like you have a bunch of extra higher level opcodes in your ISA.

Compress a shift and load into a 32-bit word and macro-op fuse those and you have in effect an index based load instruction, without sucking up ISA encoding space for it.

kragen · on Dec 4, 2021

I agree that this is about the code size, but you seem to be doing your back-of-the-envelope estimates based on RISC-V uncompressed instructions, which is a mistake and explains why your estimates came out with nonsense results like "code size larger than Intel/AMD".

quotemstr · on Dec 3, 2021

> For x86, cmp/jnz must be 5 bytes for short loops or 9 bytes for long loops, because the REX prefix is normally needed.

It's a shame the x32 architecture didn't catch on. https://en.wikipedia.org/wiki/X32_ABI

theresistor · on Dec 2, 2021

ARM64 has cbz/tbz compare-and-branch instructions that cover many common cases in a single 4-byte instruction as well.

adrian_b · on Dec 3, 2021

There are cases when cbz/tbz are very useful, but for loops they do not help at all.

All the ARMv8 loops need 2 instructions, i.e. 8 bytes, instead of the single compare-and-branch of RISC-V.

There are 2 ways to do simple loops in ARM, you can either use an addition that stores the flags, then a conditional branch, or you can use an addition that does not store the flags, then a CBNZ (which tests whether the loop counter is null). Both ways need a pair of instructions.

Nevertheless, ARM has an unused opcode space equal in size to the space used by CBNZ/CBZ/TBNZ/TBZ (bits 29 to 31 equal to 3 or 7 instead of 1 or 5).

In that unused opcode space, 4 pairs of compare-and-branch instructions could be encoded (3 pairs corresponding to those of RISC-V plus 1 pair of test-under-mask, corresponding to the TEST instruction of x86; each pair being for a condition and its negation).

All 4 pairs of compare-and-branch would have 14-bit offsets, like TBZ/TBNZ, i.e. a range larger than that of the RISC-V branches.

This addition to the ARM ISA would decrease the code size by 4 bytes for each 25 to 30 bytes, so a 10% to 15% improvement.

audunw · on Dec 2, 2021

> I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

You have no idea what you're talking about. I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported. The only way you get much lower code density is without the C extension, but I haven't seen it not implemented in a real-world commercial core, and if it wasn't, I'm sure there was because of a benefit (FPGAs sometimes use ultra-simple cores for some tasks, and don't always care about instruction throughput or density)

It should be said that my experience is in embedded, so yes, it's unsafe code. But the embedded use-case is also the most mature. I wouldn't be surprised if extensions that help with safer programming languages would be added for desktop/server class CPUs, if they haven't already (I haven't followed the development of the spec that closely recently)

dataflow · on Dec 2, 2021

>> RISC-V has an acceptable code size only for unsafe code

> You have no idea what you're talking about.

> It should be said that my experience is in embedded, so yes, it's unsafe code.

Just going based off your reply it certainly sounds like they had at least some idea what they were talking about? In which case omitting that sentence would probably help.

voz_ · on Dec 2, 2021

Textbook example of the kind of hostility and close-mindedness that is creeping into our beloved site. Why are we dick measuring? why are we comparing experience like this? so much "I" "I" "I"...

I have no horse in the technical race here, but I certainly am put off from reading what should be an intellectually stimulating discussion by the nature of replies like this.

snvzz · on Dec 2, 2021

It was likely instigated by its parent trying to inflate himself by giving themselves some credentials, to try and give their voice more weight.

All of it, pretty sad, but I believe we should focus on the technical arguments and try to put everything else aside in order to re-conduct the discussion somewhere more useful.

wott · on Dec 3, 2021

Somehow I find it refreshing to see flamewars about ISAs, for I hadn't seen any in the last 15 years (at least). It makes me feel young again. :-)

flatiron · on Dec 2, 2021

Oh no. We don’t maintain this site with these types of comments no matter your feelings. It’s the internet. Don’t get heated!

macintux · on Dec 3, 2021

> Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community.

> Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

https://news.ycombinator.com/newsguidelines.html

zozbot234 · on Dec 3, 2021

Memory-safe programming does not need any special ISA extension compared to traditional, C-like unsafe code. Even the practical overhead of bounds- and overflow checking is all about how it impedes the applicability of optimizations, not about the checks themselves.

saagarjha · on Dec 4, 2021

It doesn't need them, but having hardware checks is generally more performant than having software ones.

GoblinSlayer · on Dec 3, 2021

RISC-V doesn't hinder safe code, that was an incorrect claim. Bound checks are done with one instruction - bltu for slices, bgeu for indexes. On intel processor you need cmp+jb pair for this.

The linked message is about carry propagation pattern used in gmp. AIU optimized bignum algorithms accumulate carry bits and propagate them in bulk and don't benefit from optimal one bit at a time carry propagation pattern.

zamadatix · on Dec 2, 2021

> Except for this good feature, the rest of the ISA is full of bad features

What are your thoughts on the way RISC V handled the compressed instructions subset?

hajile · on Dec 2, 2021

It's not too surprising. Load, store, move, add, subtract, shift, branch, jump. These are definitely the most common instructions used.

Put it side-by-side with Thumb and it also looks pretty similar (thumb has a multiply instruction IIRC).

Put it side-by-side with short x86 instructions accounting for the outdated ones and the list is pretty similar (down to having 8 registers).

All in all, when old and new instruction sets are taking the same approach, you can be reasonably sure it's not the absolute worst choice.

zamadatix · on Dec 2, 2021

It was more a question of the way it was handled (i.e. it's not a different mode and can be mixed) than what the opcode list looked like.

hajile · on Dec 2, 2021

Mode switching bloats the instruction count by shifting in and out. RISC-V does well here.

If there's a criticism, it's that the two bytes on 32-bit instructions mean the total instruction range is MUCH smaller overall until you switch to 48-bit instructions which are then much bigger.

brandmeyer · on Dec 2, 2021

It only addresses a subset of the available registers. Small revisions in a function which change the number of live variables will suddenly and dramatically change the compressibility of the instructions.

Higher-level languages rely heavily on inlining to reduce their abstraction penalty. Profiles which were taken from the Linux kernel and (checks notes...) Drystone are not representative of code from higher-level languages.

3/4 of the available prefix instruction space was consumed by the 16-bit extension. There have been a couple of proposals showing that even better density could be achieved using only 1/2 the space instead of 3/4, but they were struck down in order to maintain backwards compatibility.

brucehoult · on Dec 3, 2021

This is just rubbish.

Small revisions to a function that increase the number of live variables to more than the set that are covered by the C extension mean that reference to THAT VARIABLE ONLY have to use a full size instruction. There is nothing sudden or dramatic.

Note that a number of a C instructions can in fact use all 32 registers. This includes stack pointer-relative loads and stores, load immediate ({-32..+31}), load upper immediate (4096 * {-32..+31}, add immediate and add immediate word ({-32..+31}), shift left logical immediate, register to register add, and register move.

It's certainly possible that another compressed encoding might do better using fewer opcode, and I've seen the suggestions. The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

But anyway it does well, and the opcode space used is not excessive. If anything it's TOO SMALL. Thumb2 gets marginally better code size while using 7/8ths of the opcode space for the 16 bit instructions instead of RISC-V's 3/4.

brandmeyer · on Dec 3, 2021

> The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

The RISC-V Compressed Spec v1.9 documented the benchmarks which were used for the optimization. RV32 was optimized with Dhrystone, Coremark, and SPEC-2006. RV64GC was optimized using SPEC-2006 and the Linux kernel.

hajile · on Dec 3, 2021

> It's certainly possible that another compressed encoding might do better using fewer opcode, and I've seen the suggestions. The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

Javascript is limited to 64-bit floats as is lua and a couple other languages.

Sure, you can optimize to 31/32-bit ints, but not always and not before the JIT warms up.

adrian_b · on Dec 3, 2021

The compressed instruction encoding is very good and it is mandatory for any use of RISC-V in embedded computers.

With this extension, RISC-V can be competitive with ARM Cortex-M.

On the other hand, the compressed instruction encoding is useless for general-purpose computers intended as personal computers or as servers, because it limits the achievable performance to much lower levels than for ARMv8-A or Intel/AMD.

kragen · on Dec 4, 2021

This is wrong. RISC-V's instruction length encoding is designed in such a way that compressed instructions can be decoded very quickly. It doesn't pose the same kind of performance problem amd64's variable instruction length encoding does. Even if it did, it would be obvious nonsense to claim that this made it impossible for a RISC-V implementation to achieve amd64-like performance; if it were true, it would also make it impossible for amd64 implementations to achieve amd64-like performance, which is clearly a contradiction. And ILP and instruction decode speed are also improved by other aspects of the RISC-V ISA: the fused test-and-branch instructions you mentioned, but also the MIPS-like absence of status flags.

FullyFunctional · on Dec 3, 2021

This of course utter nonsense. There's nothing different about the performance of compressed instructions.

adrian_b · on Dec 3, 2021

For competitive performance in 2021 with CPUs that can be used at performance levels at least as high as those required for mobile phones, it is necessary to decode simultaneously at least 8 instructions per clock cycle (actually more for RISC-V, because its instructions do less than those of other CPUs).

The cost in area and power of a decoder for variable-length instructions increases faster with the number of simultaneously-decoded instructions than the cost of a decoder for fixed-length instructions.

This makes the compressed instruction encoding incompatible with high-performance RISC V CPUs.

For the lower performance required in microcontrollers, the compressed encoding is certainly needed for adequate code density.

The goals of minimum code size and of maximum execution speed are contradictory and the right compromise is different for an embedded computer and for a personal computer.

That is why ARM has different ISAs for the 2 domains and why also RISC-V designs must use different sets of extensions, depending on the application intended for them.

socialdemocrat · on Dec 3, 2021

RISC-C compressed instructions cannot be compared to CISC variable length instructions. The instruction boundaries are easy to determine in parallel for multiple decoders. Something which is hard for e.g. x86. Compressed instructions don’t have arbitrary length. It is two instructions fitted in a 32-bit word.

Decompression is part of the instruction decoding itself. It only requires a minuscule 400 logical gates to do.

In fact RISC-V is very well designed for doing out-of-order execution of multiple instructions as instructions have been specifically designed to share as little state as possible. No status registers or conditional execution bits. Thus most instructions can run in separate pipelines without influencing each other.

serentty · on Dec 14, 2021

8-wide is the absolute state of the art. Last I checked, AMD’s fastest core was 4-wide, and Intel’s was 6-wide. I only know of Apple doing 8-wide, and not anyone else. So branding this as the minimum necessary for mobile devices when even most desktops do not achieve it is silly.

theresistor · on Dec 2, 2021

> I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

It's perfectly possible to have read the spec and disagree with the rationale provided. RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

> Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

This doesn't seem to be true when you actually do an apples-to-apples comparison.

Taking as an example the build of Bash in Debian Sid (https://packages.debian.org/sid/shells/bash). I chose this because I'm pretty confident there's no functional or build-dependency difference that will be relevant here. Other examples like the Linux kernel are harder to compare because the code in question is different across architectures. I saw the same trend in the GCC package, so it's not an isolated example.

riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB

RV64 is outperforming the other 64-bit architectures, but under-performing 32-bit ARM. This is consistent with expectations: amd64 has a size penalty due to REX bytes, arm64 got rid of compressed instructions to enable higher performance, and armhf (32-bit) has smaller constants embedded in the binary.

Compressed instructions definitely do work for making code smaller, and that's part of why arm32 has been very successful in the embedded space, and why that space hasn't been rushing to adopt arm64. For arm32, however, compressed instructions proved to be a limiting factor on high performance implementation, and arm64 moved away from them because of it. Maybe that's due to some particular limitations of arm32's compressed instructions that RISC-V compressed instructions won't suffer from, but that remains to be proven.

adrian_b · on Dec 2, 2021

The size of the files can be very misleading, because a large part of the files can be filled with various tables with additional information, with strings, with debugging information, with empty spaces left for alignment to page boundaries and so on. So the size of the installed files is not necessarily correlated with the code size.

To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.

I have never seen until now a program where the RISC-V variant is smaller than the ARMv8 or Intel/AMD variant, and I doubt very much that such a program can exist. Except for the branches, where RISC-V frequently needs only 4 bytes instead of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the other instructions it is very frequent to need 8 bytes for RISC-V instead of 4 bytes for ARMv8.

Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.

justinpombrio · on Dec 2, 2021

So (i) the research on RISC-V that shows it has dense code is bunk, and (ii) the fact that it compiles to a smaller binary is irrelevant, and (iii) it sounds like you're saying in advance that it might also have smaller code section sizes within the binary but that's irrelevant too.

And yet you're quite confident that RISC-V has poor code density. So you clearly have a source of knowledge that others don't. If it's a blog/article/research, could you share a link? If it's personal experimentation, you should write a blog post, I would totally read that.

pierrebai · on Dec 2, 2021

Re-read what was written. He is saying exactly that the RISCV code size is larger, but to see it you need the right tool used the right way to actually look at the code, not debug info, constant sections, etc.

snvzz · on Dec 2, 2021

I would hope there's something more to his reasoning.

There's tables showing values for just the code and RISC-V beating aarch64 and x86-64 with ample margin, in this very discussion.

serentty · on Dec 14, 2021

Once again, you’re comparing the compressed instructions without telling anyone that that’s what you’re doing, because you are convinced that the compressed instructions will never be performant on a mobile or desktop core. The foundation disagrees, and every manufacturer competing in mobile-class RISC-V chips disagrees. Even if you think that they are wrong, that is the plan they are going forward with. Real world RISC-V chips targeted at the mobile space support the compressed instruction. Even if you think that they are wrong to do so, the support is there. So what you are doing is refusing to compile for the instruction set the chips actually support, instead targeting what you think they should be doing instead, and then declaring that the code density is worse. This is nonsense.

akiselev · on Dec 2, 2021

> RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field? Obviously ISAs are a low level detail and far removed from the end product, but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon, before ubiquitous internet and the open source culture could establish itself. We've now got companies using genetic algorithms to optimize ICs and people making their own semiconductors in the 100s of microns range in their garages - maybe it's time to rethink some things!

(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)

lonjil · on Dec 2, 2021

> Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field?

Well, it's exactly what many RISC-V folks are trying to do. There's news about a new high performance RISC-V core on the HN front page right now!

> but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon,

I just want to note that ARM64 was a mostly clean break from prior versions of ARM. Basically a clean slate design started in the late 2000s. It's a modern design built with the same hindsight and approximate market conditions available to the designers of RISC-V.

brucehoult · on Dec 3, 2021

I agree that arm64 and RISC-V designers had basically the same hindsight and market conditions to refer to -- and they made a lot of very similar decisions.

But arm64 does seem constrained by compatibility with arm32 -- at least in that they until now (ten years later) usually have to share an execution pipeline and register set.

Is it really conceivable that the arm64 designers had free rein to make the choice whether to use condition codes or not on a purely technical basis? I don't think so. Even if they thought -- as all other designers of ISAs intended for high performance since 1990 have (Alpha, Itanium, RISC-V) -- that it's better not to use condition codes, I don't think they would have been free to make that choice.

The same goes for whether to expose instructions using the "free" shift on the 2nd ALU input. It's not really free -- it's paid for with a longer clock cycle or an extra pipeline stage or splitting instructions into uops. And since it was there for 32 bit they might as well use it in 64 bit as well. And the same for the complex addressing modes.

btdmaster · on Dec 2, 2021

Unfortunately, it seems that, at least for gmp, the shared objects balloon in comparison to all other architectures. It is about three times bigger (6000 instead of 2000kB): https://packages.debian.org/sid/libgmp-dev. I am hopeful that this may improve with extensions, though I know little about the details.

jepler · on Dec 2, 2021

       text    data     bss     dec     hex filename
     311218    2284      36  313538   4c8c2 arm-linux-gnueabihf/libgmp.so.10
     374878    4328      56  379262   5c97e riscv64-linux-gnu/libgmp.so.10
     480289    4624      56  484969   76669 aarch64-linux-gnu/libgmp.so.10
     511604    4720      72  516396   7e12c x86_64-linux-gnu/libgmp.so.10

Strange, that's not what I see.

mianos · on Dec 2, 2021

Probably because, like most applications, that one does not have a lot of wide multiplications. It is hard not the turn this point into an insult at the OP.

ant6n · on Dec 2, 2021

Perhaps thumb2 makes an 8-wide decide much harder. Plus, then you can't have 32 instead of 16 registers.

xiphias2 · on Dec 2, 2021

A company creating embedded risc-v cpus also has some added extra instruction set extensions that conflict with the floating point instructions though.

audunw · on Dec 2, 2021

Yeah, I'm not sure he takes into considered compressed instructions, which can be used anywhere, rather than being a separate mode like Thumb on ARM.

Fusing instructions isn't just theoretical either. I'm pretty sure it is or will be a common optimisation for CPUs aiming for high performance. How exactly is two easily-fused 16-bit instructions worse than one 32-bit one? Is there really a practical difference other than the name of the instruction(s)?

At the same time, the reduced transistor count you get from a simpler instruction set is not a benefit to be just dismissed either. I'm starting to see RISC-V cores being put all over the place in complex microcontrollers, because they're so damn cheap, yet have very decent performance. I know a guy developing a RISC-V core. He was involved with the proposal for a couple of instructions that would put the code density above Thumb for most code, and the performance of his core was better than Cortex-M0 at a similar or smaller gate count. I'm not sure if the instructions was added to the standard or not though.

Even for high performance CPUs, there's a case to be made for requiring fewer transistors for the base implementation. It makes it easier to make low-power low-leakage cores for the heterogeneous architecture (big.little, M1, etc.) which is becoming so popular.

phkahler · on Dec 3, 2021

>> But ultimately, the gist of their argument is this...

Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did.

Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.

I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode.

I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...

kortex · on Dec 3, 2021

So if I'm understanding this particular tussle correctly, carry flags are problematic for optimization because they create implicit mutable shared global state, which isn't necessarily reflected in the machine code.

Risc-v basically says "lets make the implicit, explicit" and you have to essentially use registers to store the carry information when operating on bigints. Which for the current impl means chaining more instructions.

Is that correct?

That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.

R0b0t1 · on Dec 4, 2021

There's more options than just a global carry flag, like a n+1th bit on each arithmetic register. Actually, that is how I assumed they would implement it.

quotemstr · on Dec 3, 2021

> I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.

If the process you followed brought you to this, of what use was the process?

kazinator · on Dec 3, 2021

> It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.

robert_foss · on Dec 2, 2021

So how would you suggest re-writing their example in less than 6 instructions for RISC-V? X86/arm both have instructions that include the carry operation for long additions, and only require 2 instructions.

jolmg · on Dec 2, 2021

I don't even see the issue. RISC-V is supposed to be a RISC-type ISA. It's in the very name. That it takes more instructions when compared to a CISC-type ISA like x86 is completely normal.

https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...

theresistor · on Dec 2, 2021

The argument for RISC instructions (in high performance architectures) is that the faster decode makes up for the increase in instruction count. The problem is that a faster decode has a practical ceiling on how much faster it's going to make your processor, and it's much lower than 3x. If your workload is bottlenecked on an inner loop that got 3x larger in instruction count, no 15% improvement in decode performance is going to save you.

snvzz · on Dec 2, 2021

Amount of instructions matters much less if they can be fused into more complex instructions before execution.

RISC-V was designed with hindsight on fusion, thus it has more opportunities for doing it, and doing it at a lower cost.

And, due to the very high code density RISC-V has, the decoder can do its job while not having to look at a huge window.

saagarjha · on Dec 4, 2021

Ok, but where it the chip that can fuse these?

jolmg · on Dec 2, 2021

I don't know what the design goals of RISC-V were, but I would guess performance is not the key goal or at least not the only goal. It makes more sense that ease of implementation is a more important goal, if they want to make adoption easy. That's another argument for favoring RISC over CISC.

monocasa · on Dec 2, 2021

If that's the case, you can always stick a uop cache in after the decoder.

theresistor · on Dec 2, 2021

To everyone who's saying "But macro-fusion!" in response, see my comment here: https://news.ycombinator.com/item?id=29421107

userbinator · on Dec 2, 2021

Larger caches won't help much either; there's an old article I remember that compares the efficiency of various ARM, x86, and one MIPS CPU, and while x86 and ARM were neck-and-neck, the MIPS was dead last in all the comparisons despite having more cache than the others. RISC-V is very similar to MIPS.

snvzz · on Dec 2, 2021

Larger caches, as seen in Apple's M1 L1, are one of many tools to deal with bad code density.

RISC-V might, at first glance, look similar to MIPS, but it leads in code density among the 64 bit architectures.

User23 · on Dec 2, 2021

> [RISC-V] leads in code density among the 64 bit architectures.

You keep baldly asserting this in virtually all of your very many replies here, with a vague appeal to your own authority, but you haven't shown anything. Given that the submission is precisely an example of bad code density, if you're really here in the service of intellectual curiosity then please show instead of just telling.

brucehoult · on Dec 3, 2021

No argument from authority is needed.

Anyone is free to download the disk images for a large body of software such as the same versions of Ubuntu or Fedora, and compare the binary sizes -- using the "text" output from "size" command, not raw disk files as there are also things such as debugging info in there.

Here's an example, using (ironically) the GMP library itself.

https://news.ycombinator.com/item?id=29423324

Here we see riscv64 significantly smaller than the other 64 bit ISAs aarch64 (28.1%) and x86_64 (36.5%), and beaten by a smaller margin by 32 bit ARM Thumb2 (-17.0%)

Anyone can check the sizes of bash, perl, emacs ... whatever they want ... themselves, without relying on the word of anyone here.

astrange · on Dec 3, 2021

That could be confounded by gcc not enabling -funroll-loops, alignment of branch targets, etc for some architectures.

brucehoult · on Dec 7, 2021

Conceivably, different ISAs could have the code compiled with different size/performance tradeoffs. One assumes that the Ubuntu and Fedora etc people would make sensible choices, consistent across ISAs. I don't know any reason why they would both make the same bad choices (which by "bad" here I am talking about small code at the expense of speed).

Gcc certainly supports unrolling, alignment of labels / loops / functions on all targets I'm aware of.

rbanffy · on Dec 2, 2021

I don't think there is anything preventing the processor to fuse those instructions into a single operation once they are decoded.

adrian_b · on Dec 2, 2021

Instruction fusion is the magical rescue invoked by all those who believe that the RISC-V ISA is well designed.

Instruction fusion has no effect on code size, but only on execution speed.

For example RISC-V has combined compare-and-branch instructions, while the Intel/AMD ISA does not have such instructions, but all Intel & AMD CPUs fuse the compare and branch instruction pairs.

So there is no speed difference, but the separate compare and branch instructions of Intel/AMD remain longer at 5 bytes, instead of the 4 bytes of RISC-V.

Unfortunately for RISC-V, this is the only example favorable for it, because for a large number of ARM or Intel/AMD instructions RISC-V needs a pair of instructions or even more instructions.

Fusing instructions will not help RISC-V with the code density, but it is the only way available for RISC-V to match the speed of other CPUs.

Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

snvzz · on Dec 2, 2021

>Unfortunately for RISC-V, this is the only example favorable for it, because for a large number of ARM or Intel/AMD instructions RISC-V needs a pair of instructions or even more instructions.

Yet, as many pointed out to you already, RISC-V has the highest code density of all contemporary 64bit architectures. And aarch64, which you seem to like, is beyond bad.

>but it is the only way available for RISC-V to match the speed of other CPUs.

Higher code density and lack of flags helps the decoder a big deal. This means it is far cheaper for RISC-V to keep execution units well fed. It also enables smaller caches and conversely higher clock speeds. It's great for performance.

This, if anything, makes RISC-V the better ISA.

>Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

Grasping at straws. RISC-V has been designed for fusion, from the get-go. The cost of doing fusion with it has been quoted to be as low as 400 gates. This is something you've been told elsewhere in the discussion, but that you chose to ignore, for reasons unknown.

avianes · on Dec 2, 2021

I see that you are pretty active here in debunking anti-RISC-V attacks, thanks for that! There are a bunch of poor criticisms about RISC-V.

> This is something you've been told elsewhere in the discussion, but that you chose to ignore, for reasons unknown.

I would call it RISC-V bashing.

Everyone loves to hate RISC-V, probably because it's new and heavily hyped.

It is really common to see irrelevant and uninformed criticism about RISC-V. The article, which seems to be enjoyed by the HN audience, literally says: "I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project". How can anyone say such a thing about a collaborative project of more than 10 years, fed by many scientific works and projects and many companies in the industry?

I do not mean that RISC-V is perfect, there are some points which are source of debate (e.g. favoring a vector extension rather than the classic SIMD is a source of interesting discussion). But I would appreciate on HN to read better analysis and more interesting discussions.

imtringued · on Dec 3, 2021

The more I think about CPU implementations the more I think that what RISC-V is doing isn't as bad as many people think. Everyone is going "more instructions = worse".

But the truth is that if you can build a CPU that fetches an infinite number of instructions per cycle, your biggest bottleneck isn't going to be the number of instructions, it's going to be unpredicted branches, jumps and function calls because fetching the entire function + everything behind it doesn't help, if you're going somewhere else. But the opposite is also true, adding more instructions than you need doesn't hurt as much as many people seem to think.

In practice the code density of RISC-V is not significantly worse than other architectures. So we don't even have to imagine an infinitely large fetcher, a finite fetcher that is bigger than what x86 CPUs have is good enough.

avianes · on Dec 3, 2021

> But the truth is that if you can build a CPU that fetches an infinite number of instructions per cycle, your biggest bottleneck isn't going to be the number of instructions, it's going to be unpredicted branches

Note that: Within Superscalar processors a group of instructions that are decoded at the same cycle is called a decoding group.

Branching is a problem, but the branch predictors do an excellent job. (especially for function calls which are very well predicted by the RAS [Return Address Stack]) But the biggest bottleneck to fetch a large instruction group is decoding.

Especially the instruction size decoding. An ISA like RISC-V or ARM that drastically reduces the possible instruction sizes is a big advantage to decode large instruction groups.

And dependencies between instructions within the decoding group is also a concern. For example, register renaming will quickly require several cycles (several stages) when the decoding group scales up. RISC-V also addresses this since the register indexes are easily decoded earlier and the number of registers used can also be quickly decoded.

And you're right, these are topics that are rarely addressed by RISC-V detractors.

foxfluff · on Dec 3, 2021

> How can anyone say such a thing about a collaborative project of more than 10 years, fed by many scientific works and projects and many companies in the industry?

Well the statement you quoted might be exaggerating things quite a bit but you're also just handwaving. The base ISA isn't a result of 10 years of industry experts doing their best; it's an academic project and a large proportion of it carried out by students:

> Krste Asanović at the University of California, Berkeley, had a research requirement for an open-source computer system, and in 2010, he decided to develop and publish one in a "short, three-month project over the summer" with several of his graduate students. [..] At this stage, students provided initial software, simulations, and CPU designs

The ISA specification was released in 2011! In one year! Of course there's been revisions since then, the most substantial being 2.0 in 2014 (I think). But if you look at the changes and skip anything that isn't just renaming / reordering / clarifying things, it always fits on half a page. It's by and large the same ISA, with some nice finetuning.

And here's the thing, a lot of people who originally read the spec felt like it is what it looks like, a "textbook isa", very much the kind of thing a group of students might come up with (I wonder why?).. just taken to completion. And what I remember from the spec (I read it long ago) is that cost of implementation was almost always a primary concern (and that high performance inmplementations would have to work harder but shrug itsnobigdealright?): it smelled like a small academic ISA tuned specifically for cheap microcontrollers. Not an ISA designed by industry experts for high performance cores. But the hype party is trying to sell it as a solution for all your computing needs, and almost seem to claim that no tradeoffs have been made against performance? And this is on a submission about performance, which is of course a subject a lot of people find interesting...

So I think there's very much reason to be critical of and discuss the ISA. Some critique may come from wrong assumptions, but being critical is not just bashing, and calling (attempts at) technical criticism uninformed & irrelevant and handwaving it away with 10 years of hype isn't helping the discussion. Better contribute that better analysis you refer to. (Unfortunately it seems like mostly everyone is just arguing without posting technical analysis)

avianes · on Dec 3, 2021

> Some critique may come from wrong assumptions, but being critical is not just bashing

Focusing criticism only on this architecture, producing criticism without any solid argument, putting aside all the positive aspects, criticism when you clearly lack of expertise, sending similar criticisms in several conversations when already debunked on multiple occasions.. This is really systematic on discussion about RISC-V. That's what I call bashing, I cannot call that legitimate, correct or constructive criticism.

> The ISA specification was released in 2011! In one year! [...] But if you look at the changes and skip anything that isn't just renaming / reordering / clarifying things, it always fits on half a page.

I encourage anyone to open the original document published in 2011 [1] and compare it with current RISC-V specification documents [2].

There is very little left from the 3 month student work, it mainly remains the philosophy which is probably highly influenced by project supervisors. Moreover, it mainly consists of the basic RISC-V ISA which is indeed designed to be simple and minimalist, whereas the current RISC-V full spec. consists of a multitude of extensions.

At this stage your statement and the statement from the email is not just exaggerated, it is pure misinformation.

> Not an ISA designed by industry experts for high performance cores

Okay, the industry wasn't involved as much as today in the beginning. But RISC-V is really the product of experts in the field of high performance architectures

[1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-...

[2] https://riscv.org/technical/specifications/

foxfluff · on Dec 3, 2021

> I encourage anyone to open the original document published in 2011 [1] and compare it with current RISC-V specification documents [2].

And for extra credit, look how much it changed with respect to the thing that's being discussed around this submission. Integer arithmetic and carry or overflows in base ISA. Carry flag, add? Sltu? Oh right, the base ISA hasn't changed so much after all.

> Moreover, it mainly consists of the basic RISC-V ISA which is indeed designed to be simple and minimalist, whereas the current RISC-V full spec. consists of a multitude of extensions.

Hand waving about 10 years of extensions and shuffling things around does little to address criticism concerning the base ISA. Hilariously so when these extensions fail to address the criticism at all.

Still no analysis either. I'm not going to call out misinformation in a post free of information.

audunw · on Dec 2, 2021

> Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

I'm very skeptical that a RISC-V decoder would be much more complex than an X86 one, even with instruction fusion. For the simpler fusion pairs, decoding the fused instructions wouldn't be more complex than matching some of the crazy instruction encoding in X86.

For ARM I'm not so sure, but RISC-V does have very significant instruction decoding benefits over ARM too, so my guess would be that they'd be similar enough.

socialdemocrat · on Dec 2, 2021

You ignore compressed instructions on RISC-V which 64-bit ARM does not have.

And if you compare 32-bit CPUs then RISC-V has twice as many registers reducing the number of instructions needed to read and write from memory.

RISC-V branching takes less space and so does vector instructions. There are many case like that which adds up end results in RISC-V having the most dense ISA in all studies when using compressed instructions.

Dylan16807 · on Dec 2, 2021

> Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

On the other hand just splitting up x86 instructions is very expensive, and decoding in general takes a lot of work before you even start to do fancy tricks.

lordnacho · on Dec 2, 2021

How does the instruction fusion work? It seems to be mentioned in the article and by a couple of other commenters.

volta83 · on Dec 2, 2021

The CPU executes the two (or more) dependent instructions "as if" they were one, e.g., in 1 cycle.

The CPU has a frontend, which has a decoder, which is the part that "reads" the program instructions. When it "sees" certain pattern, like "instruction x to register r followed by instruction y consuming r", it can treat this "as if" it was a single instruction if the CPU has hardware for executing that single instruction (even if the ISA doesn't have a name for that instruction).

This allows the people that build the CPU to choose whether this is something they want to add hardware for. If they don't, this runs in e.g. 2 cycles, but if they do then it runs in 1. A server CPU might want to pay the cost of running it in 1 cycle, but a micro controller CPU might not.

cpeterso · on Dec 2, 2021

Do RISC-V specs document which instruction combinations they recommend be fused? Sounds like the fused instructions are an implementation detail that must be well-documented for compiler writers to know to emit the magic instruction combinations.

mlyle · on Dec 3, 2021

For this specific case, yes, the RISC-V ISA document recommends instruction sequences for checking for overflow-- that are both amenable to fusion and are relatively high performing on implementations that don't fuse.

Section 2.4, https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2...

__init · on Dec 3, 2021

It generally goes the other way around -- programmers and compilers settle on a few idiomatic ways to do something, and new cores are built to execute those quickly. Because RISC-V is RISC, it seems likely that those few ways would be less idiomatic and more 'the only real way to do x', which would aid in the applicability of the fusions.

vitno · on Dec 2, 2021

Any != All. There is a difference between synthetic benchmarks and real world test cases.

waterhouse · on Dec 2, 2021

(RISC-V fan here) This is a real-world use case. GMP is a library for handling huge integers, and adding two huge integers is one of the operations it performs, and the way to do that is to add-with-carry one long sequence of word-sized integers into another. It's not synthetic; it's extremely specialized, but real.

api · on Dec 2, 2021

So this person found a pathological case for the RISC-V instruction set?

adrian_b · on Dec 2, 2021

This is not a pathological case, it is normal operation.

A computer is supposed to compute, but the RISC-V ISA does not provide everything that is needed for all the kinds of computations that exist.

The 2 most annoying missing features are the lack of support for multi-word operations, which are needed to compute with numbers larger than 64 bits, but also the lack of support for detecting overflow in the operations with standard-size integers.

If you either want larger integers or safe computations with normal integers, the number of RISC-V instructions needed for implementation is very large compared to any other ISA.

While there are people who do a lot of computations with large numbers, even the other users need such operations every day. Large number computations are needed at the establishment of any Internet connection, for the key exchange. For software developers, many compilers, e.g. gcc (which uses precisely libgmp), do computations with large numbers during compilation, for various kind of optimizations related to the handling of constants in the code, e.g. for sub-expression extraction or for operation complexity lowering.

So every time when some project is compiled, libgmp or other equivalent library for large numbers might be used, like also every time when you click on a new link in a browser.

So this case is not at all pathological, except in the vision of the RISC-V designers who omitted support for this case.

That was a good decision for an ISA intended only for teaching or for embedded computers, but it becomes a bad decision when someone wants to use RISC-V outside those domains, e.g. for general-purpose personal computers.

panick21_ · on Dec 2, 2021

> A computer is supposed to compute, but the RISC-V ISA does not provide everything that is needed for all the kinds of computations that exist.

This is non-sense. You can still do everything you need. Its just that in some cases the code size is a bit bigger or smaller.

And the difference with compressed instruction is not nearly as big, if you add fusion the difference is marginal.

So really its not a pathological case its 'its slightly worse' case and even that is hard to prove in the real world given the other benefit RISC-V brings that compensate.

And we can find 'slightly worse case' in the opposite direction if we would go looking for them.

If you gave 2 equal skilled teams 100M and told them to make the best possible personal computer chip, I would bet on the RISC-V team winning 90 times out of a 100.

imtringued · on Dec 3, 2021

The assembly code in the email is trivial. You don't seem to understand that the carry bit dependency exists regardless of the architecture. So ultimately, just fetching more instructions is enough to achieve optimal performance. As others said, code density of RISC-V is very reasonable on average. It's not significantly worse than x86 across an entire binary.

smoldesu · on Dec 2, 2021

I don't think you're supposed to. The compiler handles that stuff, ideally RISC-V is just another compilation target.

masklinn · on Dec 2, 2021

Did you misunderstand the issue entirely?

The context here is the implementation of one of the inner loops of a high-performance infinite-precision arithmetic library (GMP), in RISCV the loop has 3x the instruction count it has in competing architectures.

“The compiler” is not relevant, this is by design stuff that the compiler is not supposed to touch because it’s unlikely to have the necessary understanding to get it as tight and efficient as possible.

brucehoult · on Dec 3, 2021

An actual arbitrary-precision library would have a lot of loops with loops control and load and stores. Those aren't shown here. Those will dilute the effect of a few extra integer ALU instructions in RISC-V.

Also, an high performance arbitrary-precision library would not fully propagate carries in every addition. Anywhere that a number of additions are being done in a row e.g. summing an array or series, or parts of a multiplication, you would want to use carry-save format for the intermediate results and fully propagate the carries only at the final step.

bubblethink · on Dec 2, 2021

>There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Wait, if we are talking about actual ISA instructions, why is it hard to believe that RISC-V would have more of them ? The argument in favor of RISC is to simplify the frontend because even for a complex ISA like x86, the instructions will get converted to many micro-ops. In terms of actual ISA instructions, it seems quite reasonable that x86 would have fewer of those (at the cost of frontend complexity).

snvzz · on Dec 2, 2021

Code density is about requiring the least amount of bytes of code to get something done, on average.

Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.

socialdemocrat · on Dec 2, 2021

RISC V is an opinionated architecture and that is always going to get some people fired up. Any technology that aims for simplicity has to make hard choices and trade offs. It isn’t hard to complain about missing instructions when there are less than 100 of them. Meanwhile nobody will complain about ARM64 missing instructions because it had about 1000 of them.

Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.

RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.

Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.

imtringued · on Dec 3, 2021

It reminds me of nutrition advice. The 70s said X is evil or bad. Then we discover, X doesn't matter.

I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.

DeathArrow · on Dec 3, 2021

I remember Jim Keller saying that for every ISA just 7 or 8 instructions are important and they are used for like 99% of the code. Load, store and the like.

rep_lodsb · on Dec 3, 2021

That doesn't mean we should eliminate everything else. Many operations that would require 3 or more simple instructions are easy enough to implement in hardware, so that they take no more time than one of these simple instructions. This can be a big win even if they are only used 1% of the time.

klelatti · on Dec 3, 2021

Fair point but Arm is not just Arm64 - if you want a simple low cost ISA then there is Cortex-M.

Taniwha · on Dec 2, 2021

So this is one tiny corner of the ISA, not something that makes ALL instruction sequences longer - essentially RISCV has no condition codes (they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other).

It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case

RISCV also specifies a 128-bit variant that is of course FASTER than these examples

sanxiyn · on Dec 2, 2021

RISC-V designers optimized for C and found overflow flag isn't used much and got rid of it. It was the wrong choice: overflow flag is used a lot for JavaScript and any language with arbitrary precision integer (including GMP, the topic of OP).

aidenn0 · on Dec 2, 2021

Over just the time I've been aware of things, there's been a constant positive feedback loop of "checked overflow isn't used by software, so CPU designers make it less performant" followed by "Checked overflow is less performant so software uses it less."

I wish there was a way out.

Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement.

WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.

I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware.

R0b0t1 · on Dec 3, 2021

There's a blog page somewhere that's a rant for implementing saturating and other arithmetic modes. Would be a really good idea.

Main one is interrupt on overflow.

imtringued · on Dec 3, 2021

I agree. A lot of software only wants protection against overflows but does not depend on them for functionality. If something wants to read out the carry bit, it should be explicit and although it is unfortunate, indicating that requires a full instruction.

titzer · on Dec 3, 2021

> WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.

Can you refresh my memory here? What exactly is different about Wasm return values than any other function-oriented language?

aidenn0 · on Dec 3, 2021

Common Lisp will often have auxiliary return values that are often not needed (e.g. the mathematical floor function returns the remainder as a second value). Unused extra values are silently discarded.

So you can do, for example (+ (floor x y) z) without worrying about the second value to floor.

A lisp compiler for a register machine will usually return the first N values in registers (for some small value of N), so the common case of using only the primary return value generates exactly the same code regardless of how many values are actually returned.

I don't remember the details, but however wasm happens to implement multiple return values, you can't just pretend that a function that returns 2 values only returns one.

titzer · on Dec 3, 2021

Return values are pushed onto the Wasm operand stack in left-to-right order. Engines use registers for params/returns up to a point in optimized code and then spill to the stack. So at the call site you can just drop the return values you don't want and should end up with machine code that does exactly what you want.

aidenn0 · on Dec 3, 2021

wasm functions must have a fixed number of results[1]. Lisp functions may have variadic results. A wasm function call must explicitly pop every result off of the stack[2].

These rules add significant overhead to implementing:

  (foo (bar))

Which calls foo with just the first result of bar, and the number of results yielded by bar is possibly unknown.

In psuedo-assembly for a register machine, implementing this is roughly:

  CALL bar
  MOVE ResultReg1 -> ArgumentReg1
  CALL foo

And this works regardless of how many values the function bar happens to return[3].

Any implementation that is correct for any number of results of "bar" will be slow for the common case of bar yielding one or two results.

1: https://webassembly.github.io/spec/core/syntax/types.html#sy...

2: https://webassembly.github.io/spec/core/exec/instructions.ht...

3: Showing only the caller side of things, when the caller doesn't care about extra values hides some complexity of implementation of returning values because you also need to be able to detect at run-time how many values were actually returned. e.g. SBCL on x86 uses a condition flag to indicate if there are multiple values, because branching on a condition flag lets you handle the only-1 value case efficiently.

titzer · on Dec 6, 2021

Ok, I see. Yes, there is some overhead if you want not just multiple returns, but variadic returns. Most virtual machines and programming languages make this fairly inefficient because of fixed arity of returns. AFAICT, targeting anything other than machine code is going to cost you in this case.

throwaway81523 · on Dec 3, 2021

Common Lisp supports multiple value returns, which C doesn't. Go sort of does, though.

titzer · on Dec 3, 2021

Wasm supports multiple value returns.

roca · on Dec 2, 2021

Also Rust applications are increasingly going to be built with integer overflow checking enabled, e.g. Android's Rust components are going to ship with integer overflow checking. And unlike say GMP, that poses a potential code density problem because we're not talking about inner loops that can be effectively cached, it's code bloat smeared across the entire binary.

rstuart4133 · on Dec 4, 2021

> it's code bloat smeared across the entire binary.

That's probably not true in the usual case. Most arch's are 64 bit nowadays. If you are working on something that isn't 64 it you are doing embedded stuff, and different rules and coding standards apply (like using embedded assembler rather than pure C or Rust). In 64 bit environments only pointers are 64 bits by default, almost all integers remain 32 bit. Checking for a 32 bit overflow on a 64 bit RISC-V machine takes the same amount of instructions as everywhere else. Also, in C integers are very common because they are used as iterators (ie, stepping along things in for loops). But in Rust, iterators replace integers for this sort thing. There still is an integer under the hood of course, and perhaps it will be bounds checked. But that is bounds checked - not overflow checked. 2^32 is far larger than most data structures in use. Which means while there may be some code bloat, the lack full 64 integers in your average Rust problem means it's going to be pretty rare.

Since I'm here, I'll comment on the article. It's true the lack of carry will make adds a little more difficult for multi precision libraries. But - I've written a multi precision library, and the adds are the least of your problems. Adds just generate 1 bit of carry. Multiplies generate an entire word of carry, and they almost a common as adds. Divides are no so common fortunately, but the execution time of just one divide will make all the overhead caused by a lack of carry look like insignificant noise.

I'm no CPU architect, but I gather the lack of carry and overflow bits makes life a little easier for just about every instruction other than adc and jo. If that's true, I'd be very surprised if the cumulative effect of those little gains didn't completely overwhelm the wins adc and jo gets from having them. Have a look at the code generated by a compiler some time. You will have a hard time spotting the adc's and jo's because there are bugger all of them.

roca · on Dec 4, 2021

It's a good observation that checking for 32-bit overflow is not too bad, but...

> In 64 bit ... almost all integers remain 32 bit.

I don't believe this is true for Rust, even if we exclude index integers in iterators because we think those overflow checks can be optimized out. Certainly in my Rust project (Pernosco) there is a lot of manipulation of 64-bit integer data.

Taniwha · on Dec 2, 2021

Yeah but the code required for an overflow check is just one extra instruction (3 rather than 2)

masklinn · on Dec 2, 2021

For generalised signed addition, the overhead is 3 instructions per addition. It can be one in specific contexts where more is known about the operands (e.g. addition of immediates).

It’s always 1 in x64/ARM64 as they have built-in support for overflow.

Taniwha · on Dec 2, 2021

you have to include the branch instruction too in any comparison

kccqzy · on Dec 3, 2021

In x86 it is still one instruction: jc or jo after an addition.

brucehoult · on Dec 3, 2021

And in RISC-V it's a "BLT sum,summand" after an unsigned addition, and a single branch instruction after a signed addition if you know the sign of one of them e.g. adding or subtracting a constant.

rurban · on Dec 3, 2021

built-in support for overflow is not really builtin. There's only adc, but the C library has no such function, and compilers only got them added a few years ago. Almost nobody uses them, as they introduce a dependency. Most workaround that in much slower ways, worse than RISC-V

kannanvijayan · on Dec 2, 2021

It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

That said, I think it's less of an issue these days for JS implementors in particular. It might have mattered more back in the day when pure JS carried a lot of numeric compute load and there weren't other options. These days it's better to stow that compute code in wasm and get predictable reliable performance and move on.

The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.

Another factor is that JS impls use int32s as their internal integer representation, so there should be some relatively straightforward approach involving lifting to int64s and testing the high half for overflow.

Still kind of cumbersome.

There are similar issues in existing ISAs. NaN-boxing for example uses high bits to store type info for boxed values. Unboxing boxed values on amd64 involves loading an 8-byte constant into a free register and then using that to mask out the type. The register usage is mandatory because you can't use 64-bit values as immediates.

I remember trying to reduce code size and improve perf (and save a scratch register) by turning that into a left-shift right-shift sequence involving no constants, but that led to the code executing measurably slower as it introduced data dependencies.

formerly_proven · on Dec 2, 2021

> It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

It just feels backwards to me to increase the cost of these checks in a time where we have realized that unchecked arithmetic is not a good idea in general.

kannanvijayan · on Dec 3, 2021

I think I agree it was a mistake/wart. I can understand the frustration of the GMP dev in the text - they get hit hard by this. The omission feels arbitary and capricious and a bit ideologically motivated. The defense of the choice seems like post-hoc rationalization.

RiscV looks like a nice ISA otherwise.

I wouldn't be surprised if it was eventually extended to add a set of variant instructions that wrote flags to an explicitly specified register.

R0b0t1 · on Dec 3, 2021

Shouldn't settle for less than interrupt on overflow tbh.

Dylan16807 · on Dec 3, 2021

That would probably be the easiest to add, wouldn't it?

R0b0t1 · on Dec 3, 2021

It should be, but they never do. It's really important for HPC workloads.

audunw · on Dec 2, 2021

Which is exactly the right trade-off for embedded CPUs, where RISC-V is most popular right now.

If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.

Even for server-class CPUs you could argue that you absolutely want this extension to be optional, as you can design more efficient CPUs for datacenters/supercomputers where you know what kind of code you'll be running.

zozbot234 · on Dec 2, 2021

They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.

adrian_b · on Dec 2, 2021

Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder).

So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

ansible · on Dec 3, 2021

> Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder). So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

Ah, but where you do put that bit that you got for free?

A condition codes register, global to the processor / core state? That worked terrific for single-issue microcontrollers back in the 1980's. Now you need register renaming, and all the expensive logic around that to track which overflow bit is following which previous add operation. That's what's being done now for old ISAs, and it generally disliked for several reasons (complexity being chief among them).

Well, you could stuff that bit into another general purpose register, but then you kind of want to specify 4 registers for the add command. Now where are the bits to encode a 4th register in a new instruction format. RISC-V has room to grow for extensions, but another 5 bits for another register is a big ask.

imtringued · on Dec 3, 2021

I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

ansible · on Dec 4, 2021

> I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

That sort of design can be done, but that just pushes the problem around. Let's look at the original ARM 64 bit code:

    adds  x12, x6, x10
    adcs  x13, x7, x11

The second add with carry uses the global carry bit, it isn't passed as an argument to the adcs instruction. So if you store the carry bit with the x12 register, you would then need to specify x12 in the adcs instruction on the next line. So you need a new instruction format for adcs that can specify four registers.

You could change the semantics, where the add instructions use one register as the source and the destination, like on x86-64, but that's a whole 'nother discussion on why that is and isn't done on various architectures.

mlyle · on Dec 3, 2021

> A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

1. There's not multiple additions in the recommended sequences. Unsigned is add,bltu; Signed with one known sign is add, blt; Signed in general is add, slt, slti, bne.

2. These instruction sequences are specified so that an instruction decoder can treat these sequences following the add as a "very wide" instruction specifying to check an overflow flag, if a hardware implementation so chooses.

adrian_b · on Dec 3, 2021

Even if a dedicated comparator can be a little cheaper than a full adder, all CPUs already have full adders/subtractors, so all the comparisons are done by subtraction in the same adder/subtractor.

So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.

mlyle · on Dec 3, 2021

> So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.

For the signed overflow case, with compile-time unknown value added, if the processor doesn't do anything fancy to elide the recommended sequence (fusion or substitution of operation) only. Not for the other two cases, and not with fusion.

throwaway81523 · on Dec 2, 2021

> They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.

I would like to see some benchmarks of this efficient implementation in hardware, even simulated hardware, compared against conventional architectures.

Even for C, it's a recurring source of bugs and vulnerabilities that int overflow goes undetected. What we really need is an overflow trap like the one in IEEE floating point. RISC-V went the opposite direction.

wyldfire · on Dec 3, 2021

Is it worth the encoding space to define new ALU instructions with overflow flag semantics? Or could there be an executive format that implies a different mode?

theresistor · on Dec 2, 2021

This isn't an isolated case. RISC-V makes the same basic tradeoff (simplicity above all else) across the board. You can see this in the (lack of) addressing modes, compare-and-branch, etc.

Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

lottospm · on Dec 2, 2021

I'm not an expert on ISA and CPU internals, but an X86 instruction is not just "an instruction" anymore. Afaik, since the P6 arch Intel is using a fancy decoder to translate x86/-64 CISC into an internal RISC ISA (up to 4 u-ops per CISC instruction) and that internal ISA could be quite close to the RISC-V ISA for all I know.

Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.

Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.

theresistor · on Dec 2, 2021

As someone who is an expert on ISA and CPU internals, this meme of "X86 has an internal RISC" is an over-simplification that obscures reality. Yes, it decodes instructions into micro-ops. No, micro-ops are not "quite close to the RISC-V ISA".

Macro fusion definitely has a place in microarchitecture performance, especially when you have to deal with a legacy ISA. RISC-V makes the very unusual choice of depending on it for performance, when most ISAs prefer to fix the problem upstream.

jasonhansel · on Dec 2, 2021

Indeed. Also not an expert, but relying on macro-op fusion in hardware is tricky IIRC since different implementors will (likely) choose different macro-ops, resulting in strange performance differences between otherwise-identical chips.

Of course, you could start documenting "official" macro-ops that implementations should support, but at that point you're pretty much inventing a new ISA...

seoaeu · on Dec 2, 2021

RISC-V does document "official" macro-ops that implementations are encouraged to support.

evilos · on Dec 3, 2021

> "X86 has an internal RISC" is an over-simplification that obscures reality

Is it misleading though? I don't mind simplifications unless they are misleading. Would like to hear your criticisms of this meme.

imtringued · on Dec 3, 2021

X86 processors could have an internal VLIW for all we know. The instruction length would be very small by Itanium standards but still. It could be anything.

evilos · on Dec 3, 2021

Seems to me that the parent comment was claiming to know.

lonjil · on Dec 2, 2021

Most commonly used x86_64 instructions decode to only 1 or 2 µops, thus often also just as "complex" as the original instructions.

Tuna-Fish · on Dec 2, 2021

> but an X86 instruction is not just "an instruction" anymore.

This is technically true but not really. Decoding into many instructions is mainly used for compatibility with the crufty parts of the x86 spec. In general, for anything other than rmw or locking a competent compiler or assembly writer will only very rarely emit instructions that compile to more than one uop. The way the frontend works, microcoded instructions are extraordinarily slow on real cpus.

Modern x86 is basically a risc with a very complex decode, few extra useful complex operations tacked on, and piles and piles of old moldy cruft that no-one should ever touch.

monocasa · on Dec 2, 2021

X86 doesn't have to go to microcode to have multiple uOPs for an instruction. Most uarchs can spit out three or four uOPs per instruction before having to resort to the microcode ROM. Basically instructions that would only need one microcode ROM row in a purely microcoded design can be spit out of the fast decoders.

rstuart4133 · on Dec 4, 2021

> it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.

What you say is true in the very early says: compilers did indeed use the x86's addressing modes in all sorts of odd ways to squeeze as many calculations as possible into as few bytes as possible. Then it went in the reverse direction. You started seeing compilers emitting long series of simple instructions instead, seemingly deliberately avoiding those complex addressing modes. And now it's swung back again - I'm the complier using addressing modes to shift plus a couple of adds in one instruction is common again. I presume all these shifts were driven by speed of the resulting code.

I have no idea why one method was faster than the other - but clearly there is no hard and fast rule operating here. For some internal x86's implementations using complex addressing modes was a win. On some, for exactly the same instruction set, it wasn't. There is no cut and dried "best" way of doing it, rather it varies as the transistor and power budget changes.

One thing we do know about RISC-V is it is intended to cover a _lot_ transistor and power budgets. Where it's used now (low power / low transistor) is turned out their design decisions have turned out _very_ well, far better than x86.

More fascinatingly to me, today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you.

I have no idea how that will pan out for RISC-V. I don't think any one has done a super scalar implementation of it yet(?) But in the non-super scalar implementations the RISC-V instruction set choices have worked out very well so far. And when someone does do a super scalar implementation (and I'm sure there will be a lot of different implementations over time), it seems very possible x86's learnings on addressing mode use will be yesterdays news.

nickez · on Dec 2, 2021

For those use cases you typically have specialised hardware or an FPGA.

vadfa · on Dec 2, 2021

So when h266 or whatever comes out you can't watch video anymore because your cpu can't decode it in software even if it tried?

lvh · on Dec 2, 2021

An FPGA can be reprogrammed, and we do really do this for standards with better longevity than video standards (e.g. cryptographic ones like AES and SHA). For standards like video codecs, we just use GPUs instead, which I assume is what OP had in mind for "specialized hardware" (specialization can still be pretty general :-)).

plorkyeran · on Dec 2, 2021

Hardware video decoding is done by a single-purpose chip on the graphics card (or dedicated hardware inside the GPU), not via software running on the GPU. Adding support for a new video codec requires buying a new video card which supports that codec.

imtringued · on Dec 3, 2021

SystemC bloat will require you to upgrade to a bigger FPGA!

classichasclass · on Dec 2, 2021

So do what Power does: most instructions that update the condition flags can do so optionally (except for instructions like stdcx. or cmpd where they're meaningless without it, and corner oddballs like andi.). For that matter, Power treats things like overflow and carry as separate from the condition register (they go in a special purpose register), so you can issue an instruction like addco. or just a regular add with no flags, and the condition register actually is divided into eight, so you can operate on separate fields without dependencies.

crest · on Dec 2, 2021

IIRC few Power(PC) cores really split the condition register nibbles into 8 renamable registers and while Power(PC) includes everything (including at least two spare kitchen sinks) only a few instructions can pick which condition register nibble to update. Most integer instructions can only update cr0 and floating point instructions cr1. On the other hand you can do nice hacks with the cornucopia of bitwise available bitwise operations on condition register bits and it's one of the architectures where (floating point) comparisons return the full set of results (less, equal, greater, unordered).

adrian_b · on Dec 2, 2021

On POWER, all the comparison instruction can store their result in any of the 8 sets of flags. The conditional branches can use any flag from any set.

The arithmetic instructions, e.g. addition or multiplication, do not encode a field for where to store the flags, so they use, like you said, an implicit destination, which is still different for integer and floating-point.

In large out-of-order CPUs, with flag register renaming, this is no longer so important, but in 1990, when POWER was introduced, the multiple sets of flags were a great advance, because they enabled the parallel execution of many instructions even in CPUs much simpler than today.

Besides POWER, the 64-bit ARMv8 also provides most of the 14 predicates that exist for a partial order relation. For some weird reason, the IEEE FP standard requires only 12 of the 14 predicates, so ARM implemented just those 12, even if they have 14 encodings, by using duplicate encodings for a pair of predicates.

I consider this stupid, because there would not have been any additional cost to gate correctly the missing predicate pair, even if it is indeed one that is only seldom needed (distinguishing between less-or-greater and equal-or-unordered).

jabl · on Dec 2, 2021

ARM also does something similar, many instructions has a flag bit specifying whether flags should be updated or not. It doesn't have the multiple flag registers of POWER though.

crest · on Dec 2, 2021

Which at least back in the day neither the IBM compilers nor GCC 2.x - 4.x made much use of. I've seen only a few handoptimzed assembler routines get decent use out of them. Easy to fuse pairs a probably a good compromise for carry calculation e.g. add + a carry instruction. That would get rid of one of the additional dependencies, but it would take a three operand addition or fusing two additions to get rid of the second RISC V specific dependency. And while GMP isn't unimportant it is still a niche use case that's probably not worth throwing that much hardware resources at to fix the ISA limitations in the uarch.

bshanks · on Dec 7, 2021

Which feature are you saying was not used much? Addition with carry; or addition without carry; or multiple flag registers?

monocasa · on Dec 2, 2021

> they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other

It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC) the dependency issue can go away because they're renamed and their results aren't dependent on previous data.

Now, of course, instructions that only update some condition flags and preserve others are the devil.

bitwize · on Dec 2, 2021

It's not a tiny corner. People do arithmetic with carry all the time. Arbitrary precision arithmetic is more common than you think. Congratulations, RISC-V, you've not only slowed down every bignum implementation in existence, all those extra instructions to compute carry will blow the I$ faster, potentially slowing down any code that relies on a bignum implementation as well.

andrekandre · on Dec 3, 2021

  > all those extra instructions to compute carry will blow the I$ faster

i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?

user-the-name · on Dec 2, 2021

> RISCV also specifies a 128-bit variant that is of course FASTER than these examples

Is it actually implemented on any hardware?

ncmncm · on Dec 2, 2021

No. Mentioning it is only meant to distract.

TheCondor · on Dec 2, 2021

Is there a semi competitive Risc-V core implemented anywhere?

It all seem hypothetical to me now, fast cores would fuse the instructions together so instruction count alone isn't adequate for the original evaluation of the ISA. Now I'm not sure that there are any that really do that..