Hacker News new | past | comments | ask | show | jobs | submit login
Bitwise Division (p5r.org)
167 points by stek29 on March 1, 2023 | hide | past | favorite | 74 comments



Horner’s method [0] is also commonly used on small microcontrollers without divide (or multiply) instructions.

[0] https://www.ti.com/lit/an/slaa329a/slaa329a.pdf


Look out. Any time someone brings up microcontroller multiply, there's a someone who is compelled to chime in with "Well I've never needed to multiply on a microcontroller, so why do you need to!".


Or when the controller has a divide but it is painfully slow.


The compiler is already reducing integer division by a constant into these things.

Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.


libdivide tl;dr:

> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]

> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.

> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]

[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide


The obvious question this invites: if this is generally faster for unknown values, why don't compilers use this optimization directly in emitted code?


They could, but it's less profitable, and cost models are, as always, a big problem. If we have:

  loop for i from 0 to n
      f(a[i]/y)
How big does n have to be before it makes sense to compute a reciprocal for y? And how frequently is it actually that big during the runtime of the code?


> Compilers usually do this, but only when the divisor is known at compile time

if the divisor is in a variable or otherwise 'hidden', the compiler can't deduce enought to get to it, is my guess.


Calculating the divisor values is also expensive, this works when you do this work once and then do the efficient divides multiple times.


See https://godbolt.org/z/3Yqbceaza for what godbolt says clang produces.

Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.


> Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

You can use __builtin_assume for this: https://godbolt.org/z/K4jKhxnTq


An assert() also does the trick: https://godbolt.org/z/MecvMGPdW

edit uh but when asserts are disabled it won't work: https://godbolt.org/z/4TMs1Wc5z

unless you roll your own assert: https://godbolt.org/z/4v35rrTvn


9*x/64 still reduces to 2 instructions

https://godbolt.org/z/6WsWqh4ah


I learned about shifting to divide or multiply when writing my game boy emulator. Something I find fascinating is the lack of precision needed for games of that fidelity.

Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.

(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)


I also wrote a game boy emulator recently with the purpose of educating myself. I haven't looked that deep into what games do wrt math, but I was really impressed with all the creative ways they abuse the PPU by changing its various registers mid-frame. "Batman return of the Joker" for example uses this interesting effect of "compressing" the upper portion of the screen when you enter the options menu. It does this by changing the scroll Y to specific values on specific lines.

Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.

But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!


This is reminiscent, to me, of AIM-239 [0] and Warren's "Hacker's Delight" [1]. Good stuff.

[0] https://dspace.mit.edu/handle/1721.1/6086

[1] https://en.m.wikipedia.org/wiki/Hacker%27s_Delight


> And it happens to be the case that 7 × 9 = 63, which is almost 64, a power of 2

While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.

Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.


It looks nicer in hexa: 0xFF = 0xF*0x11 and 0xFFFF = 0xFF*0x101


You can derive a relationship between division and shift-add-multiply in a cute way by noting that

   1       1        1   1 
  ---  -  ---  =   --- --- (a-b)
   b       a        b   a

Move 1/a to the right hand side

   1      1     1   1 
  ---  = --- + --- --- (a-b)
   b      a     b   a  

               [X]
Substitute 1/b marked with X for the RHS and you get

   1     1    (a-b)       1   1
  --- = --- + ----- (1 + --- --- (a-b))
   b     a     a^2        b   a

Repeat and eventually you get

      __
  1   \    -(n+1)    n
 --- = >  a     (a-b)
  b   /_  
      n=0...inf
For example with b=3 and a=2, you get

  1     1     1     1
 --- = --- - --- + --- -+ ... 
  3     2     4     16
Word of warning though, this method tend to produce nasty carry errors.


(63 - nlz) / 7, where nlz is between 0 and 63

lookup table with 64 entries anyone?


It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.


A bitshift oneliner IS nicer.


Maybe nicer, but slower. And I'm not even sure about the former.


Lookup tables can be slower. Do not assume that memory is fast. Even if the table is small and fits in cache, it will still be displacing other useful things from the cache. If you can have a small sequence of fast instructions using only registers and no branches, that's very likely to be faster -if not much faster- than lookup tables. Just one L1 cache miss could take much longer to resolve than computing this particular sequence of instructions.


The example is literally one cache line, which probably won't affect the rest of the program too much. But given the average L1 throughput, I'd bet the bitwise version is faster


Probably faster: no memory loads, no cache pressure


Cache pressure is not really relevant, the result can be represented in 3 bits, so the entire table can fit in three 64 bit ints. It's uglier and you still need to do bitshifts, but much easier to write.


The method in the blog post uses two bit shifts and two additions. I don't really see how to beat that with this.

You can fit twenty-one 3-bit entries in a 64 bit int, with one bit to spare.

So naively you get:

   table[nlz / 21] >> (nlz % 21)
Which involves a division and a modulo. And that's assuming 63 entries in total, I'm not even trying to handle fitting the 64th entry in those three leftover bits somehow, or spread them across the three ints (again, I don't see how to do that without division).

Alternatively, each integer could contain one bit of the output, so:

   ((table[0] >> nlz) << 2) + ((table[1] >> nlz) << 1) + (table[1] >> nlz)
Which is five shifts and two additions, so more work plus lookup.

If you see another method I overlooked please tell me.


> ((table[0] >> nlz) << 2) + ((table[1] >> nlz) << 1) + (table[1] >> nlz)

> Which is five shifts and two additions, so more work plus lookup.

Something like that, perhaps with more bitmasking and less addition. There is no lookup, just three constants loaded into different registers and handled by different out of order execution units, then combined in a final addition / or-ing. So what you will "see" on the critical path is two bitshifts and three bitwise ORs/ANDs.

It's not faster, the point is that the speedup of "binary division" is probably not worth the day or so spent developing it unless in extreme cases.


Ah, right, I forgot the &1 masks. And I guess that with three literals and OOOE it might be faster, yes.

But yes, extremely unlikely to be a hot path in most situations.


No need to pack bits. Use whole words. The table with 64 entries is small, and if the operation is performed often, the table will always be in L1 cache. Table lookup will take 1 cycle. The method with shifts takes 4 cycles. (It can make sense only in tight loops. The gain is really negligible in absolute terms in the large scheme of things)

Timing of instructions for Intel can be found here: www.agner.org/optimize/instruction_tables.pdf


I know, but the person I replied to specifically said "it fits in three 64 bit integers" to point out you could squeeze the table into three words. I just tried to figure out how that could possibly be advantageous here (I don't think it can).

Best I can come up with is using 32 8-bit integers (so 6h bits more than 3 times 64 bitss in size) and doing thir:

    (table[nlz >> 2] >> (nlz & 3)) & 7
… which is two shifts, two ANDs and one lookup. I'm fairly certain that two shifts and two additions beat that. Well, woulo beat that if everything else in the code wasn't likely to be a much bigger bottleneck that dwarfs the impact of this


Before there had been more complex instructions, doing entire multiplications or divisions at once, computers handled this by dedicated multiplication shift and division shift instructions.

Compare here for the DEC PDP-1 (1959) and related algorithms:

[0] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...


Don’t modern compilers do this automatically (and aggressively) for almost any division by a constant?


Sometimes, but not always.

In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.

Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.

Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!


The author mentions that this is for a variable length (i.e. bigint) format. Sure, I bet mature bigint libraries do stuff like this automatically when sensible, but I still think it's interesting to read about someone figuring out such things on their own.


It's about variable-length encodings, not variable length integers.


Yeah. The (63-nlz)/7 is part of a piece of code implementing an encoding. nlz is an int.


Wouldn’t a lookup table be a whole lot easier? I’d think a 64 entry table of integers would be fast to access once cached.


It's possible to use a lookup table (sort of) in a register. With x86-64-v2 and later, the code is larger, but potentially faster due to the shorter dependency chain:

int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }

https://godbolt.org/z/Wj6P6jYGM

I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.


This is a very cool technique. Someone ought to benchmark this vs the method described in the original article.


I don't use lookup tables that much when optimizing these days, because most math instructions cost far fewer cycles than a cache miss. And even if your lookup table is in L3, it still costs ~40 cycles to retrieve each line.


That's a good instinct to have. However in this case the input is 0..63 and the output comfortably sits in a single byte. It can fit in just one cache line if you bother to align it.

The miss cost is, therefore, not really relevant: if this code is hotpath at all the cost of a single cache miss is amortized across millions of calls. Your lookup table will be in cache as surely as the code that reads it is.


Maybe easier, if limited to the specific case of an input in the range [0..63] and a divisor of 7, but not necessarily faster. Keep in mind the latency of a memory read is typically around 4 cycles, and at least some architectures have a tighter limit on the number of concurrent memory transactions than the number of concurrent integer math instructions, and keep in mind the article is generalizing beyond the initial problem involving the numbers 63 and 7- if you have more than a couple different divisors, and if you want to allow the input to go as large as possible, then a lookup table quickly becomes wasteful and impractical.


The divisor is from 0 to 63. The dividend is a variable bit length integer, i.e. a bigint.


Absolutely not, he needs to compute (63 - x) / 7, where x is computed from a 64-bit integers. There are no bigints involved.


Well, there are variable length integers involved, since it's as part of a way of representing variable length integers. But it seems you're right that x is just a regular machine int in this context, as an implementation detail. I misunderstood.


My goto would have been multiplication by (1/7)*2^16 followed by a right shift. You still need to verify the constant (or can add another after the multiply) to get it to work in every case. Used this once to extract digits of a 16bit integer from left to right.


Your C compiler will do this for you automatically. See eg https://godbolt.org/z/3Yqbceaza

(I don't know what language is used here. I'm just assuming something like C.)


Very nice description of the bitwise maths, however I'm curious what lead to the need for `(63 - nlz) / 7` in the first place. A divisor of 7 is an odd thing to find when dealing with power-of-2-sized integers. Maybe this expression is used in some sort of expected length calculation to find the number of bytes which would be used for preallocating buffers? In most variable length encoder loops I've seen, you would just do a few bit twiddles to compute your encoded byte values and your exit condition should be a trivial check if there are remaining non-zero bits.


OP mentions "playing around with variable-length integer encodings."

`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.

I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.


Nice, if a little bit hand wavy. Seems a little bit of a stretch to call an operation that still includes a multiply bitwise, though.


The final non-generalized version uses two additions and two bitwise shifts, no multiplication:

    (v + (v << 3) + 9) >> 6


Naw, it’s standard to call such tricks “bitwise” even when including a multiply. Remember you’re multiplying into a specific bit range and then shifting down (dividing by a power of 2) to capture the bits you want, it’s bitwise in a very literal sense. Probably quite fair to call any expression “bitwise” if any single operation in the expression is bitwise, regardless of the other operators & functions, no? What part is hand-wavy? Variations of this technique are in standard widespread usage in the compilers we use.


But that requires turning ‘multiplication by a constant’ into a bitwise trick in exactly the same way that this is doing for division by a constant (albeit without the fuzziness about rounding).


Why’s that?


Because ‘multiplication by n’ requires some sort of iterative process. Or, at least, multiplying by a k-bit number requires cascading through k repeats of a word-level bitwise operation (a shift, an add, an AND, etc)

Which is precisely the kind of process that we’re trying to come up with a shortcut to avoid in the case of this division we’re trying to produce.

Whereas multiplying by a specific number can be reduced to a concrete, limited number of shifts and adds, like this process is deriving for a specific division.


So? Why are you assuming you have to carry the bitwise operation through the multiplication? You don’t have to. You can if you want, but it would be academic and impractical. Using a multiply inside of the bitwise div operation makes the divide much faster. Deconstructing a multiply into bitwise operations makes the multiply much slower, so there’s no reason to do that, and there is no such requirement that we do before we can call the div trick a bitwise algorithm.

That said, if you want a bitwise multiply algorithm, I have written one. It’s slow, but enables 64 and 128 bit multiplies in glsl, which has no 64 or 128 bit multiply or add intrinsic. See add() and mult() in the “common” buffer. https://www.shadertoy.com/view/7dfyRM



I wonder whether the author ever ran a benchmark?

Alas, the commenting system on their website seems broken, so can't ask there.


Author her. I didn't run benchmarks. I'm suspicious of micro-benchmarks and I don't have a context where I can try it against realistic data. Also, I just enjoy the maths of it even if it turns out not to make a huge performance difference in practice.


Thanks for replying! It's definitely a nice write-up.

You are right that micro-benchmarks are a bit suspicious, but they are better than nothing.

Btw, have a look at https://godbolt.org/z/zMarEnYP5 to see what Clang come up with on her own.

    #include<stdint.h>

    uint64_t div(uint64_t nlz) {
        __builtin_assume(nlz <= 63);
        return (63 - nlz) / 7;
    }

    div:                                    # @div
            xori    a0, a0, 63
            andi    a1, a0, 255
            li      a2, 37
            mul     a1, a1, a2
            srli    a1, a1, 8
            subw    a0, a0, a1
            slli    a0, a0, 56
            srli    a0, a0, 57
            add     a0, a0, a1
            srli    a0, a0, 2
            ret
This uses Risc-V assembly. Just for fun. x86 is also fascinating.

I haven't analysed it in detail. But it looks like Clang doesn't seem to mind multiplication.


I hope this isn't considered doxing, but he has his name and location on the blurb on the side; you may be able to reach him through linkedin: https://www.linkedin.com/in/christianplesnerhansen/


Just precompute a list of values?


I always have a hard time grooking bit wise operations. Any good resources?


"Hacker's Delight" book was really useful for me even though I already had a good grasp on bitwise operators. I sat down with a pencil and went through the bit matrix transpose example and it really opened my eyes to how much interesting stuff can be done this way. I think the key to understanding is to visualize the bitwise operations.

Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.

Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.


Honestly just practice. They cost me an interview I really wanted once, so now anytime I stumble across one in a project I take extra time to read and understand it. I also make a point of using bitmasks as function arguments in places where that's a good/reasonable choice.


Start counting in binary on your fingers until you're just about "binary native". Also, perform operations, and think about why they behave the way they do.

It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.

You might start appreciating the imperial system of measurement though. You've been warned.


Think about all tricks you can so in decimal, but in base 2 instead.

E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.

Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...


I think wikipedia does a quite good job explaining what different operations do: https://en.m.wikipedia.org/wiki/Bitwise_operation


Very cool and it must have been really satisfying to have figured it out.


ChatGPT tells me:

Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.

Here's how it can be done:

    First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:
lua

2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)

So k = 3, since 2^3 ≡ 1 (mod 7).

    Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:
scss

x/7 = (x + (x >> 3)) * k

Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.

Here's some example code in Python that implements this division algorithm:

python

def div_by_7(x): k = 3 return (x + (x >> 3)) * k

Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.


so... 7/7 = (7+(7>>3))*3 = 21? gotcha


Well it is chat gpt so of course it's confidently incorrect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: