Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Minimax – A Compressed-First, Microcoded RISC-V CPU (github.com/gsmecher)
171 points by gsmecher on Nov 1, 2022 | hide | past | favorite | 38 comments
RISC-V's compressed instruction (RVC) extension is intended as an add-on to the regular, 32-bit instruction set, not a replacement or competitor. Its designers intended RVC instructions to be expanded into regular 32-bit RV32I equivalents via a pre-decoder.

What happens if we explicitly architect a RISC-V CPU to execute RVC instructions, and "mop up" any RV32I instructions that aren't convenient via a microcode layer? What architectural optimizations are unlocked as a result?

"Minimax" is an experimental RISC-V implementation intended to establish if an RVC-optimized CPU is, in practice, any simpler than an ordinary RV32I core with pre-decoder. While it passes a modest test suite, you should not use it without caution. (There are a large number of excellent, open source, "little" RISC-V implementations you should probably use reach for first.)




This is very impressive, especially the performance per LUT! Did I overlook frequency spec on a given target or did you not specify?

Will the execute stage pipeline effectively to reach higher f_max? (Of course there will be a small logic penalty, and a larger FF penalty, but the core is small enough that it would probably be tolerable.) Or is the core's whole architecture predicated on a two stage design?


This core is targeted at "smaller-is-better" applications with few actual instruction-throughput requirements. If it reaches 200 MHz on a Xilinx KU060, I will be delighted. (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)

With that in mind: the single instruction-per-clock design is for simplicity's sake, not performance's sake. If the execution stage were pipelined, it'd be a different core. If performance is the goal, I'd start by ripping out some of the details that distinguish this core from other (excellent) RISC-V cores.


> 200 MHz on a Xilinx KU060

> (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)

Fun clue! Looks like the Xilinx KU060 is a rad-hard FPGA for space applications. Does anyone know what 200 MHz might imply? Comms maybe?


KU060 costs a nice sum of £4,529.10 on Mouser (out of stock of course)


A fully space-qualified version is something like $150k.


> out of stock of course

I picked probably the worst time imaginable to get into FPGAs. All of my "higher" end stuff is repurposed mining hardware...


That is actually a good time, because some of it can get super expensive. Also the mining ones just end up in landfills if nobody buys them.


Mining landfills is the way. Future generations will not believe how much magic we threw away.


Poor man's Tile64?


That is very, cool. I'm particularly interested in the compressed-first approach because I have some projects where minimising BRAM usage is paramount so code density really matters. The use of microcode to emulate 32-bit instructions reminds me a lot of ZPU (I still have a soft spot for that architecture) - was that an influence?


I've heard of the ZPU in passing but never looked in much detail - I didn't realize there was a GCC back-end for these machines. James Bowman's J1 CPU [0] is also stack-based and has definitely helped me shape my preferences.

[0]: https://excamera.com/files/j1.pdf


This is very nice. A couple years ago I was playing around with a hobby project I was dubbing "Retro-V" which was to be a RISC-V core tied to a 1980s-style display processor and keyboard/mouse input on a small FGPA and 512k or 1MB or so of SRAM. I was using PicoRV32 for that, but this would have been be far better.


PicoRV32 and FemtoRV32 are both excellent, conventional RISC-V implementations, and are more complete and proven than Minimax. Relative to the size of any 7-series or newer Xilinx FPGA, the difference in LUT cost between any of the three is pretty minor. I think you made a perfectly defensible decision. (I love me some SERV, too, and if you are willing to spend orthodoxy to save gates, it's an excellent choice too.)


Yes, PicoRV32 is very nice. However for what I was building, with limited RAM, compressed instructions would have made a lot of sense. I started porting a BASIC to my system (in C), and it quite easily would have filled almost the whole 512kB SRAM.

And the thought of handwriting one in RISC-V assembly convinced me that maybe RISC-V wasn't as "retro friendly" as I would have liked.


Understood. Maybe this landed after your project - but both PicoRV32 and SERV now support compressed extensions, at some additional resource cost. FemtoRV32 Quark doesn't - which is not a knock, since it's a beautifully simple implementation and that's the point.

The retrocomputing scene looks like a ton of fun and I'd be delighted if any of my work is used there.


Ah, yes, this was 2018/19, in the Before Times, and I don't recall if PicoRV32 had compressed yet but I don't think it did.

SERV always looked intriguing, too. Though I recall maybe its build process was a hassle.

Anyways, this is neat, keep on keeping on! I'm just a software guy, so I remain amazed by the world at the gate level and what it can do. Entirely different kind of abstraction building.


What is it about RISC-V assembly you didn't like? The little I've done seems like slightly more hassle than amd64 assembly but nothing like the level of bending over backwards of 6502 assembly.


RISC ISAs are about an assembly language that is for compilers to write, not humans. That's not to say you can't, but the dance in and out of the registers is tedious.

And RISC-V is more explicitly pure-RISC than ARM, and honestly I think ARM is "friendlier" to write with its conditional instructions and other conveniences. Also for some reason I find the ARM mnemonics easier to read. Maybe it's the 6502 influence, I dunno. FWIW I have never learned x86 assembly; when I went from 68000 to x86, I just started writing C instead.

I'm not saying any of this makes ARM superior. I think RISC-V's choices make more sense for a compiler target. It's not meant to be written by humans. But on a "retro" machine, assembly is kind of part of the deal.

Maybe a fancy macro assembler for RISC-V would be nice, to give a slightly higher level CISC-style set of semantics, but without going the whole hog to a high level language.

https://erik-engheim.medium.com/arm-x86-and-risc-v-microproc... has a nice breakdown.


No other assembly is as nice to program as 68000 assembly!.

It sounds like your main frustration is saving and restoring registers on entry and exit to subroutines? I agree that that's a particular pain point in RISC-V. Have you considered using millicode for that? Even without assembly macros it's not that bad.

The article you cited, aside from being soft-paywalled, is clueless. The author says, "Actually in theory an x86 instruction could be of infinite length, but dealing with infinitely long instructions is impractical. Thus both Intel and AMD set a practical limit and refuse to process instructions which are encoded as longer than 15 bytes." This indicates they don't understand the x86 instruction encoding at all, and the article is full of careless errors: RISC-V MV and LI are canonically ADDI and not ANDI, not all RISC architectures have a zero register, 3-operand instructions don't reduce memory traffic, most RISC instruction sets don't have an instruction with a 16-bit address field (certainly RISC-V doesn't), there is no "rsp" register in the 16-bit 8086, a pipeline flush doesn't flush the cache (!), and on and on.

Try to forget everything you read in that article! Instead, read something by somebody who cares whether what they're writing is true or false.


Sounds interesting! What were you using for the display processor?


I was hand-rolling my own. I had it doing a basic 640x480 buffer with some basic character generation and sprite support & HDMI/DVI output

These days I'd probably consider forking my friend Randy's C64 VICII implementation (VIC-II Kawari) and just expand framebuffer size, sprites, colours, etc, since he put so much work into it.

It was a lot of fun, but I got stalled on the SD card interface. That was more complexity than I felt with dealing at that point. And I was working at Google at the time and so they owned all my thoughts and deeds and going through the open sourcing process for it would have been a hassle. If I wasn't hunting for work and needing to make $$ right now, I'd pick it up again maybe? Was more of a verilog learning process.


the actual Verilog source is incredibly small. I would have thought that implementing a CPU, even a toy one, would take more than 500 lines. is this normal for hardware?


What you see is all there is.

At a certain scale, it's conventional for hardware designs to become complex enough that it's necessary to structure them in hierarchies, just to maintain control. This design is small enough that none of the extra structure is essential.

It's possible to be incredibly expressive in Verilog and VHDL. This implementation is written in VHDL, which has an outdated reputation for being long-winded.

Also worth a look: FemtoRV32 Quark [0], which is written in Verilog.

[0]: https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/...


Have you seen the OPC series of CPUs? (One Page Computing - the challenge being to keep the code small enough to be printed onto a single sheet of line printer paper!)


Yup! Thanks for pointing OPC [0] out. These CPUs were a huge eye-opener - and a huge lesson about the value of using a standardized instruction set.

Building a custom CPU commits you to writing an assembler and listing generator - which is a good hobby-project job for one person who's handy with Python. After stumbling through those foothills, though, I found myself at the base of some very steep, scary GCC/binutils cliffs wondering how I could have gotten so lost, so far from home.

Even if all RISC-V does is offer a bunch of arbitrary answers to arbitrary design questions, I consider it a massive win.

[0]: https://revaldinho.github.io/opc/


Yes, I also found the idea of tangling with GCC (or even llvm) less than appealing. It's not the initial work that puts me off, but the ongoing maintenance cost. For my own project (EightThirtyTwo) I ended up writing a backend for the VBCC C compiler. The downsides are (a) no C++ support, and (b) an unusual license - but the upsides were (a) a build process that takes seconds, not hours, (b) a simple generic RISC backend one can use as a starting point, and (c) a compiler lightweight enough that it could be self-hosting. (I can compile C code for EightThirtyTwo using an Amiga!)


A traditional CPU in its most basic form is nothing more than a programmable state machine where the transition function is the series of instructions you (the programmer) write down, with local state in the form of the register file, and some ports attached to a memory controller (so it can fetch and write instructions and data). A 3-stage fetch/decode/execute pipeline can be done in a very small space if you don't get clever.

This is just that. But nothing more. For example it does not handle any RISC-V CSRs, even the most basic ones. But that's OK: for "computational" machine code kernels that aren't fancy (i.e. basic ALUs get lit up but nothing fancier), you can use software toolchains to emit compatible code like GCC.

A "real" toy CPU i.e. one that won't win awards but can boot something like Zephyr OS or a maybe a miniature OS with some form of memory protection will require many more lines; for proper exception handling, for that memory protection, timers and peripherials, for extra CPU features (atomics, debug interface, whatever.) A comparable CPU for this might be something like PicoRV32, which fits in at about ~2,000 lines of Verilog.

But that's a lot of stuff. Sometimes all you need is a programmable state machine. And with this you can run (limited) normal C programs on it with a supported compiler on a 32-bit machine.


I suspect some heavier lifting is done here:

    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;
It looks that the VHDL source is about instruction decoding, registers, etc, but does not include things like ALU logic. (I don't know VHDL actually.)


Those two lines are just the VHDL equivalent of #include <stdio.h> - i.e. boilerplate that you'll see in almost every source file.

But it's true that you don't have to describe the ALU down to the bit level - thanks to those two lines you can say "q <= d1 + d2" instead of having to build an adder at the gate level. (Though you can, of course, do that if you really want to!)


> RISC-V's compressed instruction (RVC) extension is intended as an add-on

Doesn’t it make this… an IISC? Increased instruction set? Asking for a friend


RISC no longer has the clear border as it had 30 years ago. Nowadays RISC just means an ISA has most of the following points: 1. Load/Store architecture 2. Fixed-length instructions or few length variations. 3. Highly uniform instruction encoding. 4. Mostly single-operation instructions.

These four points all have direct benefits on hardware design. And compressed ISA like RVC and Thumb checks them all.

On the contrary, "fewer instruction types", "orthognoal instructions" never had any real benefit beyond perceptual aesthetics, so as a result they are long abandoned.


Can the address and/or data also be 16 bit or would that violate RISC-V spec?


AIUI the registers and operations with them should be 32bit for RV32I.

The bus is up to you... should you want a 8bit data bus and 16 bit address bus, I don't think the spec cares.

This is akin to 68020 (32bit ISA) vs 68000 (still 32bit ISA) or 68008 (still 32bit ISA).


I don't think the RISC-V spec cares, either, since it specifies an execution environment but not interfaces.

A narrower data bus would allow a 2-cycle execution path, and would likely split the longest combinatorial path in the current design (which certainly goes through the adder tree.) This could be either an 0.5 instruction-per-clock (IPC) design, or a pipelined design that maintains 1 IPC at the expense of extra pipeline hazards and corresponding bubbles.

A narrower address seems like it's only helpful as a knock-on to a split data bus.

Gut feeling: I doubt that splitting the data or address buses into additional phases would actually save resources. You would certainly need more flip-flops to maintain state, and more LUTs to manage combinational paths across the two execution stages. While you can sometimes add complexity and "win back" gates, it's an approach with limits. If you compare SERV's resource usage to FemtoRV32-Quark's, it's notable how much additional state (flip-flops) SERV "spends" to reduce its combinatorial logic (LUT) footprint.


Interesting that shifts are in the <1IPC set; I thought those were fairly cheap with a barrel shifter; does this simply omit one for space purposes, or are they more expensive than I expect?


Barrel shifters are huge in the context of small CPUs (especially on FPGA's). To do a barrel shift, you need (input size) * (shift size) LUTs, as you need that amount of "stages". That means 32*5=160 on RV32, as you can shift by 2^5 bits.

OP's CPU takes up around 400 LUTs. Since a 2:1 mux takes up 1 LUT (although it seems the numbers are for a LUT6-based device, which can take a 4:1 mux, so maybe that can make the amount a bit lower?), you would add 160 LUTs. That's quite a lot.


This depends on the fpga's resources: some have barrel shifters as hard IP.


I don't think this is true - on Xilinx, you can coax a DSP48 macro into implementing a barrel shifter, but the underlying primitive is a multiplier and not a barrel shifter.

Unlike adders, a barrel shifter does not generalize well enough to be implemented as a hard block in its own right.


[deleted]




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: