Hacker News new | past | comments | ask | show | jobs | submit login

What if the multi-precision code is written in C?

You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31

So 64-bit add in C is:

   f_low = a_low + b_low
   c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
   f_high = a_high + b_high + c_high
So for RISC-V in gcc 8.2.0 with -O2 -S -c

        add     a1,a3,a2
        or      a5,a3,a2
        not     a7,a1
        and     a5,a5,a7
        and     a3,a3,a2
        or      a5,a5,a3
        srli    a5,a5,31
        add     a4,a4,a6
        add     a4,a4,a5
But for ARM I get (with gcc 9.3.1):

        add     ip, r2, r1
        orr     r3, r2, r1
        and     r1, r1, r2
        bic     r3, r3, ip
        orr     r3, r3, r1
        lsr     r3, r3, #31
        add     r2, r2, lr
        add     r2, r2, r3
It's shorter because ARM has bic. Neither one figures out to use carry related instructions.

Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);

So with this:

RISC-V:

        add     a3,a4,a3
        sltu    a4,a3,a4
        add     a5,a5,a2
        add     a5,a5,a4
ARM:

        adds    r2, r3, r2
        movcs   r1, #1
        movcc   r1, #0
        add     r3, r3, ip
        add     r3, r3, r1
RISC-V is faster..

EDIT: CLANG has one better: __builtin_addc().

    f_low = __builtin_addcl(a_low, b_low, 0, &c);
    f_high = __builtin_addcl(a_high, b_high, c, &junk);
x86:

        addl    8(%rdi), %eax
        adcl    4(%rdi), %ecx
ARM:

        adds    w8, w8, w10
        add     w9, w11, w9
        cinc    w9, w9, hs
RISC-V:

        add     a1, a4, a5
        add     a6, a2, a3
        sltu    a2, a2, a3
        add     a6, a6, a2



> RISC-V is faster..

I find it funny that you make the same pitfall than the author did.

Faster on which CPU?

The author doesn't measure on any CPU, so here there are dozens of people hypothesizing whether fusion happens or not, and what the impact is.


All other things equal, you would prefer smaller code for better cache use.


8x 2 byte instructions (16 bytes) lead to smaller code than 4x 8 byte instructions (32 bytes).

Counting number of instructions isn't really a good metric for that either.


> Faster on which CPU?

Perhaps faster means fewer instructions in this instance? Considering number of instructions is what has been discussed.


Right, but all architectures can handle many combinations of instructions in 1 cycle, so this is not really a great proxy for that.

Same for code size. If the instructions are half the size, having 1.5x more instructions still means smaller binaries.


Note that with the newly-ratified B extension, RISC-V has BIC (called ANDN) as well as ORN and XNOR.

In addition to the actual ALU instructions doing the add with carry, for bignums it's important to include the load and store instructions. Even in L1 cache it's typically 2 or 3 or 4 cycles to do the load, which makes one or two extra instructions for the arithmetic less important. Once you get to bignums large enough to stream from RAM (e.g. calculating pi to a few billion digits) it's completely irrelevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: