interesting write-up even outside of the risc-v context, bc most of the optimisations are the same for different platforms
one interesting difference is that memory prefetching/preloading has such a huge difference, whereas on arm/x86 it barely makes any difference (in my experience) that it's not even worth doing
The chips you're used to have had decades of optimization to their automatic memory prefetching. Apples M1 etc do even better (they can speculate prefetch through a pointer which can make lots of data-structures a ton faster. Also for really high performance on x86, prefetching can often be a ~30% boost.
I don't believe there is any official documentation on this, but https://github.com/JuliaLang/julia/pull/49430 for example added prefetching to the marking phase of a GC which saw speedups on x86, but not on M1.
It should be pretty easy to find a MWE that shows the behavior. Something like summing an array of int* should do it.
i'm using an m1 laptop, so my sgemm implementation relies on arm neon vector extension, and i saw literally no performance improvements when i added prefetching, which could either be a result of their speculative optimisations or an issue in my own implementation
but i have noticed some ppl saying that adding prefetching on x86 has little performance improvement for sgemm, bc it's a scenario in which it's easier to predict the next memory address that's gonna be used
Probably preaching to the choir here, but if you want high-performance sgemm on M-series, you should consider using the system-provided BLAS. It will invoke the dedicated matmul hardware on Apple CPUs, which has access to significantly higher memory and compute bandwidth than Neon. SIMD-based matmul on M1 only makes sense if your matrices are small and you need the result immediately (e.g. geometric transformations).
I hope that Apple will implement SVE/SME one day which will allow us to leverage this hardware directly...
wait, are you suggesting apple has some hardware level GEMM instructions that are undocumented or smth? also, what do you mean by system-provided BLAS? their accelerate framework?
I am talking about the matrix/vector coprocessor (AMX). You can find some reverse-engineered documentation here: https://github.com/corsix/amx
On M3 a singe matrix block can achieve ~ 1TFLOP on DGEMM, I assume it will be closer to 4TFLOPS for SGEMM. The Max variants have two such blocks. Didn't do precise benchmarking myself, but switching Python/R matrix libraries to use Apple's BLAS result in 5-6x perf improvement on matrix heavy code for me.
yep. they have a neural engine that is separate from the CPU and GPU that does really fast matmuls https://github.com/hollance/neural-engine. it's basically completely undocumented and for programs like blas it has more horsepower than all the cores combined.
this is a great tutorial on optimizing linear algebra algorithms; the animations really help a lot and are really well done
this sgemm progress is pretty impressive. i assume 'version 0' is one of the lower lines on the first graph? he has a graph lower down that shows it getting under 300 megaflops, closer to 50 megaflops for most problem sizes. getting up to 2.2 gigaflops is seriously impressive; that's 55% of the maximum theoretical possible throughput, which is really challenging on things that aren't vector supercomputers (which don't have caches!)
with respect to the rvv assembly, it's worth noting that the allwinner d1 being optimized for here has the pre-ratification version 0.7.1 of the risc-v v extension (rvv) which has some incompatibilities with the ratified 1.0.0. i don't know the instruction set well enough to know if those incompatibilities affect this code
vaguely relatedly, i was astounded yesterday to find that gcc 12.2.0 fails at some basic optimizations on risc-v. (all of the following is with -Os)
this trivial subroutine
int sumarray(int *a, int n)
{
int total = 0;
for (int i = 0; i != n; i++) total += a[i];
return total;
}
compiles (with -Os) to a loop containing separate left shift, address-addition, counter-incrementation, and integer addition instructions, 8 instructions in all, plus a ret. omitting initialization:
1: sext.w a3,a5 ; it's my fault i declared i as int instead of size_t so really 7
bne a1,a3,2f ; return if i == n
ret
2: slli a3,a5,2 ; convert i to byte offset
add a3,a3,a4 ; compute &a[i]
lw a3,0(a3) ; load a[i]
addi a5,a5,1 ; increment i
addw a0,a0,a3 ; update total
j 1b ; lather, rinse, repeat
i can get it to compile to reasonable code (5 instructions in the loop plus a ret) by writing
int sumarray_opt(int *a, int n)
{
int total = 0;
for (int *end = a + n; a != end; a++) total += *a;
return total;
}
like the desperate, cornered animal i am
1: bne a5,a1,2f
ret
2: lw a4,0(a5)
addi a5,a5,4 ; how motherfucking annoying is it that objdump calls this an add
addw a0,a0,a4
j 1b
isn't strength-reduction something compilers know how to do already? are modern cpus so goddamned superscalar that the difference doesn't matter? (i'm pretty sure less than 0.1% of the billion-plus shipped hardware risc-v cpus out in the field are superscalar.)
needing three separate instructions in the first version for converting to a byte offset, adding the byte index to the array base register, and loading from the address is something that could be ameliorated by hardware, but would involve some tradeoffs; the zba bit-manipulation extension to the risc-v isa contains 8 instructions for that sort of thing, including a sh2add that would reduce it to 2. but reducing it to 0 inside the loop with strength reduction is better.
risc-v has a lot of talk about the great importance of macro-op fusion, but that won't help here; it could fuse the shift and add maybe but the add isn't going to go away until somebody applies some loop induction
like, optimizing code for an instruction set with literally thousands of different implementations with different performance characteristics is sort of impossible, but i dare you to find me a risc-v where the first loop runs faster than the second one. or even the same speed
on arm(32) gcc compiles both subroutines to the same code, as any reasonable human being would expect, with the loop being something equivalent to:
1: cmp r3, r1
bxeq lr
ldr r2, [r3], #4
add r0, r0, r2
b 1b
that is, the same version of gcc does do the strength-reduction optimization for arm(32) (but not aarch64, but maybe that's defensible since aarch64 is usually superscalar: https://godbolt.org/z/1dbYhEavM). this is still 5 instructions because it can combine the load and increment into a single postincrement load, but the return-if-not-equal costs us two instructions instead of one. and the actual speed of that will depend a lot more on things like macro-op fusion and branch prediction. like, it would be easy for the bne to take a lot more than two cycles.
for cortex-m4 gcc also generates the same code for both versions, but of course there's no bxeq on cortex-m4:
so in conclusion risc-v gcc still needs a lot of work to catch up with arm gcc, even to successfully apply broadly-applicable cross-platform optimizations that it successfully applies on arm, which surprised me a lot
postscript: apparently with -O3 instead of -Os gcc does a lot better on risc-v
gcc does give the 4-instr loop on all of -O1, -O2, and -O3 for the basic loop. And, to its credit, the worse loop on -Os does result in the function being 2 bytes smaller (if I'm reading it right), which is what you ask for with -Os.
And on -march=rv64gc_zba, the -Os version is 4 bytes smaller than the -O3 one.
I've also seen that -Os with GCC produces significantly slower code in pretty much all use cases I've looked at it on (Game rendering code, optimizing shader compiler, GPU drivers). And that's on mobile devices where you might expect to be more cache/memory bandwidth limited.
I believe the general idea that -Os was often the best choice was popularized as it became the default on MacOS, which uses Clang, which may bias it's optimization options differently. And Clang has had the -Oz option for if you really want to focus on total size, trading possibly lower performance, for quite a while. GCC only introduced the same relatively recently in gcc 12. So maybe it'll change the bias of Os going forward?
EDIT: Though I'm not sure about the disassembler on godbolt - I tried seeing if the c++ std::accumulate function would affect code generation, and the disassembled code looks incomplete, and doesn't match the compiler assembler output (by disabling "Compile to Binary Object") at all
I haven't bothered to use -Os much (outside of places where performance decidedly does not matter); it seems fairly clear to me that code with like any amount of hot spots (i.e. most code) would greatly suffer.
I wonder how different would things be if it was more common to select optimization level per function, instead of for the whole project (gcc allows it, clang has only a thing to disable all optimizations on a function).
hey, my reply seems to have been lost! i'll try again
yes, using -Os does seem to have been my main problem! possibly in the past gcc's optimization wasn't as flexible and so couldn't trade off performance for code size as effectively
but what surprised me in this case was not that -Os was reducing size at the expense of speed; it's that it was failing to apply a well-known, widely-applicable optimization that both reduces size and improves speed
getting down to the 4-instruction form of the loop does require an optimization that increases size while (usually) improving speed: transforming while (x) y; into if (x) do y; while (x);, which costs four bytes in this case (because the beq/bne doesn't fit into a compressed instruction; risc-v 2-byte compressed branches can only test a single register, so you only have c.beqz and c.bnez)
You can control most of the optimization flags separately, I think most are listed under the various levels here [0], might be interesting to see what causes the difference, or if it's one of the unspecified "tuning values".
iiinteresting! i can't see how to get machine code listings on godbolt, but on my own machine (with 12.2.0) the -Os sumarray is 30 bytes, sumarray_opt is 22. with -O3 sumarray shrinks to 28 bytes but its loop is indeed only 4 instructions long, and sumarray_opt is 26 bytes (with the same inner loop with slightly different register assignments). so at least that one is smaller. and it's probably true that the 22-byte sumarray_opt with its 5-instruction loop is slower on most machines than the 26-byte version—but the 26-byte version would have been both smaller and faster than what -Os emitted for sumarray originally
with -Os (please forgive objdump's unreadable disassembly and incorrect mnemonics):
still, aside from risc-v's unnecessarily pedagogical syntax, what's wrong with this?
.globl sumarray
sumarray: mv a2, a0 # compensate for abi design blunder: first argument (array base address) in return value register
slli a1, a1, 2 # convert array size to byte count
li a0, 0 # initialize return value to 0
add a1, a1, a2 # compute array end
beq a1, a2, 1f # skip loop if empty
2: lw a3, (a2) # load array item from memory
add a0, a0, a3 # add to total
addi a2, a2, 4 # increment array pointer
bne a1, a2, 2b # continue loop unless done
1: ret
except for the snarky comments, isn't this the code you'd expect from the compiler (without zba)? i wouldn't describe it as thoroughly tested, but it does seem to work. as i count it, it's 24 bytes and so smaller than anything except the original sumarray_opt, and it contains the same 4-instruction loop as the -O3 versions. i guess it's the -O3 sumarray_opt without the useless extra ret instruction, which is presumably some kind of minor bug
Appears I was testing on a version with the int counter replaced with size_t (namely, https://godbolt.org/z/sTW64EEYv vs https://godbolt.org/z/zKqjTa5fz). Regardless, indeed, your sumarray_opt one is shorter still, and with less instructions in the loop. Though beating compilers by an instruction or two is nothing particularly new, being possible often even on x86.
i did see that option but when i selected it i got no output; maybe i have to look for the output somewhere else?
the thing i was surprised by was not really that i could beat it by an instruction or two by rewriting the c code, but that i could reduce the number of instructions in the inner loop by 37% with no apparent tradeoffs by hand-applying an optimization that the compiler already does apply, itself, on arm
Seems it doesn't like -S being present (and it makes sense as you don't want assembly, you want a disassembled object file). In general there's no need for -S in compiler explorer.
Your custom 24-byte version is just two bytes smaller than the "size_t n" gcc -Os version, probably within a rounding error of what compiler developers care to achieve.
yes, agreed! it wasn't that the code size was so important to me in the grand scheme of things, it's that i was surprised by the particular way it happened to fail
I will agree it's slightly weird. But it seems to me it's less that it "fails", more that there's heuristics at play other than "speed up loop". Perhaps on other loops it's more reasonable and makes a bigger difference.
well, i haven't looked at gcc's guts, but how would not strength-reducing pointer arithmetic ever make your loops smaller? or, what other kinds of heuristics are you thinking might be in play, that happen to have that effect here? my guess is that it's not intentional behavior at that level
hmm, maybe you're right. Looking back at the "-Os -march=rv64gc_zba" output, it does the strength reduction, even though the applicability of sh2add is the same regardless of strength reduction or not. A thing that might fit that is wanting to keep the length-0 case short, but I'd be surprised if that's the actual reason.
Then there's clang which does yet another thing of changing the loop to a decrementing one, which is smaller than all options here so far: https://godbolt.org/z/M14rdfzq3; 5-instr loop though, and it stays that way even to -O3.
interesting! only 18 bytes! all compressed instructions! i'd thought of doing that (in general it can be a very tricky transformation for a compiler to make safely, since it requires the loop reduction to be commutative, which of course it is in this case) but had come to the conclusion that it wouldn't help. incorrectly, it seems
in this case it's doing a tricky thing: because it's done the strength-reduction on the pointer arithmetic, it's free to do whatever it wants with the loop counter—not just eliminate it, but also, say, reverse it. that means it doesn't have to care about the loop reduction being commutative or not
https://godbolt.org/z/M14rdfzq3 has the trailing semicolon removed, for the convenience of hypothetical other readers of this thread
here's my untested, hopefully uncorrupted reformatting of it:
.globl sumarray
sumarray: li a2, 0 # total := 0
beqz a1, 1f # skip loop if array empty
2: lw a3, (a0) # load *a
addw a2, a2, a3 # add it to total
addi a1, a1, -1 # decrement n-i (previously initialized to n)
addi a0, a0, 4 # a++
bnez a1, 2b # repeat loop unless n-i == 0
1: mv a0, a2 # move total into return value slot
ret
that's 9 instructions, 5 inside the loop, but the only output from one instruction used as an input to another is the output of the lw; plausibly a sufficiently superscalar implementation of risc-v could run an iteration of the loop per cycle? kind of funny instruction scheduling, though, with those two dependent instructions being right next to each other, which would stall a naïve pipeline. macro-op fusion won't help in this case, you really need superscalar
this is an instruction shorter than my version because it doesn't need the slli or the extra instruction to compute the loop end, though it does need the extra addi in the loop. because it's only using beqz and bnez, its branches can all be compressed. i think the main thing i wasn't appreciating was how much space i was wasting on the array end computation, because that's the kind of thing that never matters
the 'pulp' core (now core-v) got a significant speedup on dsp loops by adding special zero-overhead looping registers, so you don't have to waste cycles on decrementing and testing loop counters (on your two inner levels of looping, anyway); part of their claim was that in-order scalar execution took more time per instruction but less picojoules, so if you're limited by energy consumption rather than transistor budgets or nonparallelizable algorithms, you're better off with more, slower in-order cores. from https://docs.openhwgroup.org/projects/cv32e40p-user-manual/e... i think that would make the loop look like this:
.balign 4 # must be 4-byte aligned
.option norvc # no compressed insns!
cv.setup 0, 1f, a1 # loop starts on next insn
lw a3, (a0) # load *a
addw a2, a2, a3 # add it to total
addi a0, a0, 4 # a++
but for larger loops the loop setup can take three instructions
this is also the kind of thing that the vector extensions ought to help with, of course!
Porting it from rvv 0.7.1 to rvv 1.0 should be pretty much one to one. The harder part is that this and other gemm implementations often aren't vector length agnostic, this is mostly due to gemm requiering specific knowlage about the cpu to get good performance anyways.
Btw, with -O3 -march=rv64gcv both gcc and clang can autovectorize both of your example functions. I'd use -march=rv64gcv_zba_zbb_zbs until profiles are supported in the compilers, then we could so something like -march=rva23.
yeah, it would be hard to write a function that was easier to autovectorize. but none of the risc-v hardware i have here actually supports v (or for that matter rv64)
As well as everything else, I'd also suggest using GCC 14. GCC 12 is 2 years old, so whatever optimizations it does or doesn't do aren't that relevant.
Red Hat is scoping work upstream to make GCC perform better on RISC-V code.
usually if i'm writing c instead of js, python, or golang the reason is that i want my code to run on a microcontroller. the majority of the microcontrollers i have here have 4 kibibytes of flash, which is a pretty harsh limit, but vastly more cpu power than is needed for the things i'm doing with them, so i almost always actually want to optimize for size at the cost of nearly everything else. the only thing that's more precious than flash is ram
(why don't i write assembly? well, sometimes i do, but sometimes c is good enough, and it has major advantages. size-optimizing compilers greatly expand the scope of c's applicability.)
In theory, -Oz now serves the "smaller no matter what the cost" case, and allows -Os to make compromises. But with GCC, they still behave like synonyms in my limited experience.
i appreciate the new information! i haven't used clang much for reasons that are probably not very good
probably bytecode interpreters are a better solution for this sort of thing anyway, at least when you have more than 512 bytes of program memory or so. in forth with gforth locals the function is
: sumarray { a n } 0 n 0 ?do a i cells + @ + loop ;
and the body (after the }) consists of 12 instructions, almost all of which would be a single byte each in compact bytecodes like smalltalk-80 or emacs lisp. this might even beat clang's 18-byte rvc version above. there's a sequence of three of them that would conventionally be a single operation (cells + @), but conventionally you'd also use extra instructions to load and store the variable `total` inside and outside the loop instead of leaving it on the stack, so you'd end up with 13 instructions. and the ?do and loop instructions normally have an offset, so you're at 15 bytes, plus the subroutine header
gforth of course uses 8 bytes per instruction and doesn't have a structure like smalltalk's method header to compactly encode the number of locals, so gforth itself is not very frugal with code space; instead { a n } compiles to >l >l at the beginning and lp+2 at the end
There is no theoretical difference between "bytecode" and "machine code", other than speed.
Bytecode is just "a machine code that is designed better than those stupid hardware engineers for the computer I'm using actually did".
Historically, it made a lot of sense on crappy 8 bit CPUs such as the 6502 and z80, if only for the ability to manipulate 16 or 32 bit quantities with a single instruction. Often you also got a bigger and more orthogonal register set, or stack orientation. Bytecode was also often better than the idiosyncratic instruction sets on early minicomputers and mainframes -- usually in size, but especially in being portable between the vast array of different ISAs we used to have before the world settled on just x86 and Arm.
But I don't think there's much of a place for it in the modern world. Perhaps a little for portability, with simpler and more well-defined semantics than C or other languages you compile into bytecode, with the intention to compile it to native code before running.
But not for size.
I've yet to see a bytecode that consistently beats a well-designed modern machine code for size, over an entire real world program, not just some set of cherry-picked toy examples. I include here primarily Arm Thumb2 and RISC-V with the C extension, but there are some other less well-known ones e.g. Renesas RX which is basically an improved M68K and the only example I know of a variable-length instruction encoding at byte granularity that is actually done right and has a significant usage of 1-byte instructions in real code.
I include, especially, stack-oriented bytecodes. They look amazing for evaluating complex arithmetic expressions. But programs mostly do not consist of complex expressions, they do a lot of "a = a + 1" and "c = a + b" and using four instructions to implement these using 1-byte instructions is no better, size-wise, than a simple RISC with 4 byte 3-address instructions. It's worse than Thumb, which does both of these in 2 bytes (assuming the variables are in registers).
> I've yet to see a bytecode that consistently beats a well-designed modern machine code for size, over an entire real world program, not just some set of cherry-picked toy examples.
This feels like a bit of a cop out to me, because one of the biggest ways you get the size advantage from a bytecode is by making it less general purpose than the machine it's running on.
Imagine a micro driving a single LED in some long arbitrary pattern. One way to do that is to just have a one-bit PCM array describing the pattern. Maybe this is semantic... but I call that a one-bit instruction bytecode with two instructions (LED ON and LED OFF), and I guarantee you no hardware ever manufactured can beat it using native code (if the pattern is over a certain length).
Sure, that's a toy example if you compare it to a general purpose CPU, but there are a lot of real microcontrollers out there doing jobs that simple.
that's a thing you can do, and there are lots of sources of inspiration for that kind of thing; i'd point at the pico/rp2040's pioasm instruction set, the vt52's instruction set (see the dec manuals on bitsavers), and ken shirriff's analysis of the 8086's microcode http://www.righto.com/2023/04/8086-microcode-string-operatio... for examples
but i think you can also often get some advantage by going in the other direction; for example, the smalltalk-80 virtual machine has bytecode range 176–207 dedicated to 'send arithmetic message' and 'send special message' bytecodes which i think send messages with the selectors +, -, <, >, <=, >=, =, ~=, *, /, \\\\, @, bitShift:, //, bitAnd:, bitOr:, at:, at:put:, size, next, nextPut:, atEnd, ==, class, blockCopy:, value, value:, do:, new, new:, x, and y. almost all of the time these boil down to a regular cpu instruction or two, because the receivers are built-in classes with primitive methods implementing those messages, but a method containing them can also be applied to objects that implement different methods for them (large integers being the most common example). this implicitly makes smalltalk methods for things like sorting, data structures, and numerical algorithms polymorphic over arbitrary data types, as in ocaml. so you can get by with a single generic version of the subroutine instead of various specialized versions
> I include, especially, stack-oriented bytecodes ...
WASM uses stack-oriented bytecode for simplicity and to make static analysis easier, not just for terseness. (Arguably, the real underlying issue with stack-oriented bytecode is the amount of stack fiddling you have to include in more complex programs. It also does not generalize cleanly to parallel execution, unlike, e.g. dataflow or even register-based SSA.)
in general smalltalk-80 and similar stack bytecodes include almost no stack fiddling; the only stack-manipulation bytecodes included are dup and drop, not even swap or over. instead of stack fiddling, you have load and store bytecodes—in smalltalk, you usually have them in precisely the places the source program had atomic expressions in, respectively, rvalue or lvalue contexts. this simplifies the compiler and also allows you to use a decompiler instead of storing the source code
even on the f18a core in chuck moore's latest designs from greenarrays, there's no swap; you have dup, drop, over, and r> and >r (called push and pop). and of course the way he generalized it cleanly to parallel execution is by putting 144 computers on the same chip, each with separate ram. the code is also pretty compact but not in a way that's practical for bytecode interpretation on a byte-oriented computer
yes, i mostly agree with this, and rvc is especially competitive on this point
often the hardware engineers weren't being stupid. it is possible to some degree to trade off hardware speed and complexity against code size (the most obvious example is instruction alignment requirements), and architectures like arm64 (aarch64) have gone pretty far in the direction of higher speed at the expensive of larger code size, even if not quite as far as things like itanic and the original mips. if you're using such an architecture, bytecode can give you a large compactness advantage, so designers try not to use them where that counts. still, the modern world contains vastly more avr and pic16 chips than it does x86 and arm, and those architectures are not great for code density, especially for c code that tends to use a lot of ints
but thumb2 and rvc are pretty hard to beat. i think compactness-optimized stack bytecodes can do it, but only by a little. on this example (which is admittedly a toy example, but not one i cherry-picked to showcase the merits of stack machines) rvc gets to 18 bytes, though, as you can see from the thread, how to get there was far from obvious to me. the stack bytecode is probably 15 bytes, but it probably needs a procedure header declaring the number of arguments to get there, so 17 bytes. plus probably a 2-byte entry in a global procedure table, which rvc doesn't need. you can implement the bytecode interpreter so that it doesn't have any alignment requirements, while rvc requires 2-byte alignment, which you should expect to cost you about a byte every so often on average—though i'm not sure how to account for that in a case like this. one byte per subroutine? one byte per basic block? is that already included in the 18?
note, though, that for the first hours of this thread, rvc was stuck with considerably worse code size at 22 bytes. it wasn't until dzaima whipped out their clang that we realized that by adding a redundant down-counter register (and extra decrement instruction in the inner loop) we could trim it down to 18 bytes
i'm amused that you describe rx as 'an improved m68k', because my thought when i first looked it over (on january 18) was that it looked like renesas thought, 'what if we did the 80386 right?' but it's true that, for example, its addressing mode repertoire is a lot more m68k-ish. by my crude, easily gamable measures, it looks like it's one of the most popular five cpu architectures, which makes it surprising that i hadn't heard of it before this year. my notes are in the (now misnamed) file remaining-8-bit-micros.md in http://canonical.org/~kragen/sw/pavnotes2.git
with respect to your particular examples of non-complex expressions, a large fraction of `a = a + 1` and `n = n - 1` are loop counters, which can be handled with a one-byte or two-byte `loop` instruction, as the 8086 does (though its design choice to only handle down-counting toward zero reduces its usefulness for compiling idiomatic c code), and typically in `c = a + b` it can be arranged for one or two of the three operands to be on top of the stack, so it's typically about 2.5 bytes instead of 4, which is still worse than thumb but only slightly. so typically stack bytecodes with either one operation or one local-variable load or store per byte come in pretty close to equal to rvc-style instruction sets with one operation and two register operands per 16-bit parcel.
the mention of the ubiquitous pic and avr above reminds me that i should mention the other advantage of interpreters in general, including bytecode interpreters: they make harvard-architecture machines programmable. a substantial fraction of the pic and avr machines in the field have hundreds or thousands of bytes of ram, enough for a program that does very interesting things, and an interpreter makes it possible for them to run code from that ram. as stm32 and its clones have progressively squeezed out pic and avr, this has become less important, but if i'm not mistaken, when gigadevice made a risc-v version of their popular higher-frequency gd32f clones of the stm32 (the short-lived gd32vf) they made it harvard! (or quasi-harvard: like the newer avrs, you can read data from the flash with regular instructions, but as i understand the datasheet, you can't execute instructions from the sram—though https://offzone.moscow/upload/iblock/0a5/nad1d86e3ah3ayx38ue... and https://www.eevblog.com/forum/microcontrollers/risc-v-microc... say this is wrong)
some of my previous notes on this problem, in case it's a problem that interests you, are at https://dernocua.github.io/notes/c-stack-bytecode.html. in there, i took six example pieces of c code (from macpaint, the openbsd c library, the linux kernel, the implementation of php, an undergraduate cs exercise, and open-source flashlight firmware) and examined how to design a compact bytecode to encode them
i've definitely written a lot of assembly code by hand where i did far, far more perverted things to save a single byte; implementing integer division with an integer division instruction is quite vanilla
but in my risc-v case the code it emitted with -Os was not only larger than the code it emitted with -Os given hand-optimized input; it was also larger than the code it emitted with the same input and -O3. my complaint wasn't that the code was slow
> i've definitely written a lot of assembly code by hand where i did far, far more perverted things to save a single byte
I mean, obviously, me too. But this is a general purpose compiler: emitting a runtime division by a compile time constant power-of-two is just silly. If you actually care about saving one byte of text and you're writing C, you're doing it wrong. As you point out, it's pretty trivial to beat the compiler most of the time.
one interesting difference is that memory prefetching/preloading has such a huge difference, whereas on arm/x86 it barely makes any difference (in my experience) that it's not even worth doing
also, doesn't allwinner d1-h have simd registers?