It's a problem because it's practically useless: almost nobody writing C ever ac...

jcalvinowens · 2024-02-29T18:47:16 1709232436

Clang is much more reasonable, see:

  {0}[calvinow ~] clang -Os -S test.c -c -o -
        .text
        .file   "test.c"
        .globl  foo                             # -- Begin function foo
        .type   foo,@function
  foo:                                    # @foo
        .cfi_startproc
  # %bb.0:
                                        # kill: def $edi killed $edi def $rdi
        leal    3(%rdi), %eax
        testl   %edi, %edi
        cmovnsl %edi, %eax
        sarl    $2, %eax
        retq
  .Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
                                        # -- End function
        .ident  "Debian clang version 16.0.6 (19)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig
  {0}[calvinow ~] clang -Oz -S test.c -c -o -
        .text
        .file   "test.c"
        .globl  foo                             # -- Begin function foo
        .type   foo,@function
  foo:                                    # @foo
        .cfi_startproc
  # %bb.0:
        movl    %edi, %eax
        pushq   $4
        .cfi_adjust_cfa_offset 8
        popq    %rcx
        .cfi_adjust_cfa_offset -8
        cltd
        idivl   %ecx
        retq
  .Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
                                        # -- End function
        .ident  "Debian clang version 16.0.6 (19)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig
  {0}[calvinow ~] clang -O2 -S test.c -c -o -
        .text
        .file   "test.c"
        .globl  foo                             # -- Begin function foo
        .p2align        4, 0x90
        .type   foo,@function
  foo:                                    # @foo
        .cfi_startproc
  # %bb.0:
                                        # kill: def $edi killed $edi def $rdi
        leal    3(%rdi), %eax
        testl   %edi, %edi
        cmovnsl %edi, %eax
        sarl    $2, %eax
        retq
  .Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
                                        # -- End function
        .ident  "Debian clang version 16.0.6 (19)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

GCC has -Oz now, but they haven't fixed -Os to be softer like clang's is.

kragen · 2024-02-29T20:46:18 1709239578

usually if i'm writing c instead of js, python, or golang the reason is that i want my code to run on a microcontroller. the majority of the microcontrollers i have here have 4 kibibytes of flash, which is a pretty harsh limit, but vastly more cpu power than is needed for the things i'm doing with them, so i almost always actually want to optimize for size at the cost of nearly everything else. the only thing that's more precious than flash is ram

(why don't i write assembly? well, sometimes i do, but sometimes c is good enough, and it has major advantages. size-optimizing compilers greatly expand the scope of c's applicability.)

jcalvinowens · 2024-03-01T01:08:19 1709255299

In theory, -Oz now serves the "smaller no matter what the cost" case, and allows -Os to make compromises. But with GCC, they still behave like synonyms in my limited experience.

kragen · 2024-03-01T01:38:46 1709257126

i appreciate the new information! i haven't used clang much for reasons that are probably not very good

probably bytecode interpreters are a better solution for this sort of thing anyway, at least when you have more than 512 bytes of program memory or so. in forth with gforth locals the function is

    : sumarray { a n } 0 n 0 ?do a i cells + @ + loop ;

and the body (after the }) consists of 12 instructions, almost all of which would be a single byte each in compact bytecodes like smalltalk-80 or emacs lisp. this might even beat clang's 18-byte rvc version above. there's a sequence of three of them that would conventionally be a single operation (cells + @), but conventionally you'd also use extra instructions to load and store the variable `total` inside and outside the loop instead of leaving it on the stack, so you'd end up with 13 instructions. and the ?do and loop instructions normally have an offset, so you're at 15 bytes, plus the subroutine header

gforth of course uses 8 bytes per instruction and doesn't have a structure like smalltalk's method header to compactly encode the number of locals, so gforth itself is not very frugal with code space; instead { a n } compiles to >l >l at the beginning and lp+2 at the end

brucehoult · 2024-03-02T23:48:57 1709423337

There is no theoretical difference between "bytecode" and "machine code", other than speed.

Bytecode is just "a machine code that is designed better than those stupid hardware engineers for the computer I'm using actually did".

Historically, it made a lot of sense on crappy 8 bit CPUs such as the 6502 and z80, if only for the ability to manipulate 16 or 32 bit quantities with a single instruction. Often you also got a bigger and more orthogonal register set, or stack orientation. Bytecode was also often better than the idiosyncratic instruction sets on early minicomputers and mainframes -- usually in size, but especially in being portable between the vast array of different ISAs we used to have before the world settled on just x86 and Arm.

But I don't think there's much of a place for it in the modern world. Perhaps a little for portability, with simpler and more well-defined semantics than C or other languages you compile into bytecode, with the intention to compile it to native code before running.

But not for size.

I've yet to see a bytecode that consistently beats a well-designed modern machine code for size, over an entire real world program, not just some set of cherry-picked toy examples. I include here primarily Arm Thumb2 and RISC-V with the C extension, but there are some other less well-known ones e.g. Renesas RX which is basically an improved M68K and the only example I know of a variable-length instruction encoding at byte granularity that is actually done right and has a significant usage of 1-byte instructions in real code.

I include, especially, stack-oriented bytecodes. They look amazing for evaluating complex arithmetic expressions. But programs mostly do not consist of complex expressions, they do a lot of "a = a + 1" and "c = a + b" and using four instructions to implement these using 1-byte instructions is no better, size-wise, than a simple RISC with 4 byte 3-address instructions. It's worse than Thumb, which does both of these in 2 bytes (assuming the variables are in registers).

jcalvinowens · 2024-03-03T17:26:12 1709486772

> I've yet to see a bytecode that consistently beats a well-designed modern machine code for size, over an entire real world program, not just some set of cherry-picked toy examples.

This feels like a bit of a cop out to me, because one of the biggest ways you get the size advantage from a bytecode is by making it less general purpose than the machine it's running on.

Imagine a micro driving a single LED in some long arbitrary pattern. One way to do that is to just have a one-bit PCM array describing the pattern. Maybe this is semantic... but I call that a one-bit instruction bytecode with two instructions (LED ON and LED OFF), and I guarantee you no hardware ever manufactured can beat it using native code (if the pattern is over a certain length).

Sure, that's a toy example if you compare it to a general purpose CPU, but there are a lot of real microcontrollers out there doing jobs that simple.

kragen · 2024-03-03T18:35:47 1709490947

that's a thing you can do, and there are lots of sources of inspiration for that kind of thing; i'd point at the pico/rp2040's pioasm instruction set, the vt52's instruction set (see the dec manuals on bitsavers), and ken shirriff's analysis of the 8086's microcode http://www.righto.com/2023/04/8086-microcode-string-operatio... for examples

but i think you can also often get some advantage by going in the other direction; for example, the smalltalk-80 virtual machine has bytecode range 176–207 dedicated to 'send arithmetic message' and 'send special message' bytecodes which i think send messages with the selectors +, -, <, >, <=, >=, =, ~=, *, /, \\\\, @, bitShift:, //, bitAnd:, bitOr:, at:, at:put:, size, next, nextPut:, atEnd, ==, class, blockCopy:, value, value:, do:, new, new:, x, and y. almost all of the time these boil down to a regular cpu instruction or two, because the receivers are built-in classes with primitive methods implementing those messages, but a method containing them can also be applied to objects that implement different methods for them (large integers being the most common example). this implicitly makes smalltalk methods for things like sorting, data structures, and numerical algorithms polymorphic over arbitrary data types, as in ocaml. so you can get by with a single generic version of the subroutine instead of various specialized versions

zozbot234 · 2024-03-03T16:51:58 1709484718

> I include, especially, stack-oriented bytecodes ...

WASM uses stack-oriented bytecode for simplicity and to make static analysis easier, not just for terseness. (Arguably, the real underlying issue with stack-oriented bytecode is the amount of stack fiddling you have to include in more complex programs. It also does not generalize cleanly to parallel execution, unlike, e.g. dataflow or even register-based SSA.)

kragen · 2024-03-03T18:42:45 1709491365

in general smalltalk-80 and similar stack bytecodes include almost no stack fiddling; the only stack-manipulation bytecodes included are dup and drop, not even swap or over. instead of stack fiddling, you have load and store bytecodes—in smalltalk, you usually have them in precisely the places the source program had atomic expressions in, respectively, rvalue or lvalue contexts. this simplifies the compiler and also allows you to use a decompiler instead of storing the source code

even on the f18a core in chuck moore's latest designs from greenarrays, there's no swap; you have dup, drop, over, and r> and >r (called push and pop). and of course the way he generalized it cleanly to parallel execution is by putting 144 computers on the same chip, each with separate ram. the code is also pretty compact but not in a way that's practical for bytecode interpretation on a byte-oriented computer

kragen · 2024-03-03T16:43:29 1709484209

yes, i mostly agree with this, and rvc is especially competitive on this point

often the hardware engineers weren't being stupid. it is possible to some degree to trade off hardware speed and complexity against code size (the most obvious example is instruction alignment requirements), and architectures like arm64 (aarch64) have gone pretty far in the direction of higher speed at the expensive of larger code size, even if not quite as far as things like itanic and the original mips. if you're using such an architecture, bytecode can give you a large compactness advantage, so designers try not to use them where that counts. still, the modern world contains vastly more avr and pic16 chips than it does x86 and arm, and those architectures are not great for code density, especially for c code that tends to use a lot of ints

but thumb2 and rvc are pretty hard to beat. i think compactness-optimized stack bytecodes can do it, but only by a little. on this example (which is admittedly a toy example, but not one i cherry-picked to showcase the merits of stack machines) rvc gets to 18 bytes, though, as you can see from the thread, how to get there was far from obvious to me. the stack bytecode is probably 15 bytes, but it probably needs a procedure header declaring the number of arguments to get there, so 17 bytes. plus probably a 2-byte entry in a global procedure table, which rvc doesn't need. you can implement the bytecode interpreter so that it doesn't have any alignment requirements, while rvc requires 2-byte alignment, which you should expect to cost you about a byte every so often on average—though i'm not sure how to account for that in a case like this. one byte per subroutine? one byte per basic block? is that already included in the 18?

note, though, that for the first hours of this thread, rvc was stuck with considerably worse code size at 22 bytes. it wasn't until dzaima whipped out their clang that we realized that by adding a redundant down-counter register (and extra decrement instruction in the inner loop) we could trim it down to 18 bytes

i'm amused that you describe rx as 'an improved m68k', because my thought when i first looked it over (on january 18) was that it looked like renesas thought, 'what if we did the 80386 right?' but it's true that, for example, its addressing mode repertoire is a lot more m68k-ish. by my crude, easily gamable measures, it looks like it's one of the most popular five cpu architectures, which makes it surprising that i hadn't heard of it before this year. my notes are in the (now misnamed) file remaining-8-bit-micros.md in http://canonical.org/~kragen/sw/pavnotes2.git

with respect to your particular examples of non-complex expressions, a large fraction of `a = a + 1` and `n = n - 1` are loop counters, which can be handled with a one-byte or two-byte `loop` instruction, as the 8086 does (though its design choice to only handle down-counting toward zero reduces its usefulness for compiling idiomatic c code), and typically in `c = a + b` it can be arranged for one or two of the three operands to be on top of the stack, so it's typically about 2.5 bytes instead of 4, which is still worse than thumb but only slightly. so typically stack bytecodes with either one operation or one local-variable load or store per byte come in pretty close to equal to rvc-style instruction sets with one operation and two register operands per 16-bit parcel.

the mention of the ubiquitous pic and avr above reminds me that i should mention the other advantage of interpreters in general, including bytecode interpreters: they make harvard-architecture machines programmable. a substantial fraction of the pic and avr machines in the field have hundreds or thousands of bytes of ram, enough for a program that does very interesting things, and an interpreter makes it possible for them to run code from that ram. as stm32 and its clones have progressively squeezed out pic and avr, this has become less important, but if i'm not mistaken, when gigadevice made a risc-v version of their popular higher-frequency gd32f clones of the stm32 (the short-lived gd32vf) they made it harvard! (or quasi-harvard: like the newer avrs, you can read data from the flash with regular instructions, but as i understand the datasheet, you can't execute instructions from the sram—though https://offzone.moscow/upload/iblock/0a5/nad1d86e3ah3ayx38ue... and https://www.eevblog.com/forum/microcontrollers/risc-v-microc... say this is wrong)

some of my previous notes on this problem, in case it's a problem that interests you, are at https://dernocua.github.io/notes/c-stack-bytecode.html. in there, i took six example pieces of c code (from macpaint, the openbsd c library, the linux kernel, the implementation of php, an undergraduate cs exercise, and open-source flashlight firmware) and examined how to design a compact bytecode to encode them