Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bit Twiddling Hacks (2009) (stanford.edu)
92 points by elpocko on Aug 23, 2024 | hide | past | favorite | 25 comments


It's worth noting that modern x86 CPUs have specialized extensions for bit manipulation. Furthermore C23 added stdbit.h that has some operations for bit manipulation

https://en.wikipedia.org/wiki/X86_Bit_manipulation_instructi...

https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Modern%...


Absolutely, look into your standard library, CPU built-ins and intrinsics first.

However, these tricks are not useless even on modern CPUs. There are scalar instructions for a lot of this stuff, but if you are working with SIMD, several of them are not available.

Recently I wrote bit interleaving stuff with Morton codes. Maybe the scalar instruction would be faster in isolation but I already have my data in SSE registers for other arithmetic so in total it was probably faster. I used the code in this article as a basis.


The last link doesn't work.


This thing is posted so often on HN. Does anyone have any new cool tricks that aren't in the normal compendium of tricks?


Tricks that work on floating point are sometimes useful and not listed there..

   Make float sortable as integer. (the reason you might wish to do this is that integer comparisons are faster and run on more ports, also you won't get unsortable data from NAN). You can undo the transform by calling this fxn again. You only need this with a mix of positive/negative floats, if they are all positive you can skip this.
int i = cast_int(f); shift_right_sign_bits(i, 31) & 0x7FFFFFFF) ^ i

   Scale float by powers of 2
Cast a float to integer and add 0x00800000 * N(N being the power of 2). Subtract to divide. Fails with 0 value float.

Not super useful unless you already have the float in the integer domain for some other reason, also integer adds/subs run on more ports and are 1 cycle.

   The fast sqrt function is well known but you can do this for other powers of 2
//An approximate pow(a, 1/4) float fastPow_1_4(float a) { return asfloat((asuint(a) >> 2u) + 798700996u); }

//An approximate pow(a, 1/8) float fastPow_1_8(float a) { return asfloat((asuint(a) >> 3u) + 931853847u); }

//An approximate pow(a, 1/16) float fastPow_1_16(float a) { return asfloat((asuint(a) >> 4u) + 998350438u); }

//An approximate pow(a, 1/32) float fastPow_1_32(float a) { return asfloat((asuint(a) >> 5u) + 1031705320u); }

They are very low accuracy as they don't have a newton raphson step, and were just intended for stuff like graphics where accuracy isn't always important.


Daniel Lemire's entire blog: https://lemire.me/blog/ .

It focuses on low level optimizations, often by using SIMD, avoiding division, and being generally clever. It's nothing I use in my job, but I enjoy reading every post.


agreed! great blog


There are additional tricks that are more specialized or generalizations of the ideas here beyond their presented scope, like SWAR (SIMD Within A Register) over arbitrary packed bitfields. I realize this is not that helpful but most of the tricks not listed here won’t be interesting to most people. Many of the bit twiddling tricks that are widely known are also known to compilers, which can generate them automatically. Hell, x86 has single instructions that implement many of the well-known tricks directly.

Where these tricks become handy again is SIMD vectorization. The compiler is much less clever in these situations and the scalar bit-twiddling instructions often don’t have vectorized equivalents.


Maybe http://0x80.pl -- this site features mainly SIMD/SWAR code.


I don’t have the code handy, but I know it’s pretty easy to Google up the bit hack to convert an integer to a position in a Morton order curve.

The new thing I found long ago is that once you have done that, it’s pretty easy to figure out how to incrementally advance along the curve in much fewer instructions than the initial int—>Morton conversion.


here's one or two

shr ecx, 1 lea eax, [rcx + 1FC00000h] shr eax, 1 add eax, ecx


This reminded me of the puzzle book “xchg rax,rax”. Maybe that has some hidden insights… :P

https://www.xorpd.net/pages/xchg_rax/snip_00.html


Not sure if this counts, but I got tired of seeing all of the magic constants used in several of these tricks and not understanding how they were generated. Came up with the following via trial and error:

    static T NthFermatMask<T>(this int value) where T : IBinaryInteger<T> {
        var x = T.AllBitsSet;
        var y = T.IsNegative(value: x).As<int>();

        return ((((x >>> y) / value.NthFermatNumber<T>()) << y) | T.One);
    }
    static T NthFermatNumber<T>(this int value) where T : IBinaryInteger<T> {
        return ((T.One << (1 << value)) + T.One);
    }
NthFermatNumber generates the nth Fermat number (see https://oeis.org/A000215) and NthFermatMask generates magic constants of the form 0x55555555, 0x33333333, 0x0F0F0F0F, etc. To demonstrate usage, here is a generic interleave bits implementation:

    public static TResult BitwisePair<TInput, TResult>(this TInput value, TInput other) where TInput : IBinaryInteger<TInput> where TResult : IBinaryInteger<TResult> {
        switch (value) {
            case short:
            case ushort:
                if (Bmi2.IsSupported) {
                    return (
                        TResult.CreateTruncating(value: Bmi2.ParallelBitDeposit(mask: 0.NthFermatMask<uint>(), value: uint.CreateTruncating(value: value))) |
                        TResult.CreateTruncating(value: Bmi2.ParallelBitDeposit(mask: (0.NthFermatMask<uint>() << 1), value: uint.CreateTruncating(value: other)))
                    );
                }
                break;
            case int:
            case uint:
                if (Bmi2.X64.IsSupported) {
                    return (
                        TResult.CreateTruncating(value: Bmi2.X64.ParallelBitDeposit(mask: 0.NthFermatMask<ulong>(), value: ulong.CreateTruncating(value: value))) |
                        TResult.CreateTruncating(value: Bmi2.X64.ParallelBitDeposit(mask: (0.NthFermatMask<ulong>() << 1), value: ulong.CreateTruncating(value: other)))
                    );
                }
                break;
            default:
                break;
        }

        const int loopOffset = 7;

        int offset;
        int shift;

        var bitCountDividedByTwo = (int.CreateChecked(value: BinaryIntegerConstants<TResult>.Size) >> 1);
        var evenBits = TResult.CreateTruncating(value: other);
        var oddBits = TResult.CreateTruncating(value: value);

        if (loopOffset.NthPowerOfTwo<int>() < bitCountDividedByTwo) {
            var i = ((int.CreateChecked(value: BinaryIntegerConstants<TResult>.Log2Size) - loopOffset) - 1);

            do {
                offset = (i + (loopOffset - 1));
                shift = offset.NthPowerOfTwo<int>();

                DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift);
            } while (0 < --i);
        }

        offset = 6; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 5; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 4; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 3; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 2; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 1; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }
        offset = 0; if ((shift = offset.NthPowerOfTwo<int>()) < bitCountDividedByTwo) { DistributeBits(evenBits: ref evenBits, oddBits: ref oddBits, offset: offset, shift: shift); }

        return (oddBits | (evenBits << shift));

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static void DistributeBits(int offset, int shift, ref TResult evenBits, ref TResult oddBits) {
            var mask = offset.NthFermatMask<TResult>();

            evenBits = ((evenBits | (evenBits << shift)) & mask);
            oddBits = ((oddBits | (oddBits << shift)) & mask);
        }
    }


That's why the example also had the bit-encoding laid out.

The magic constants start off with an every other order, then combine that comb size until you're left with front and back.

    5 0101 1010 A
    3 0011 1100 C
    F 1111 0000 0


It's nice that it expresses algorithms in C. It's more concrete than most of Hacker's Delight, which is similar and more comprehensive, but often vaguer and occassionally covers tricks used on historical architectures and systems.[0]

0. Hacker's Delight (Second Edition) by Henry S. Warren, 2013, ISBN: 0321842685


It is nice, and it's also nice that it contains notes on how right-shifting a signed integer is implementation-defined in C (it's not the much-feared undefined behavior) so inherently non-portable.

You should of course always check your compiler output when doing stuff like this, but doubly so in those cases.


Unfortunately it doesn’t mention how left-shifting a negative signed integer is undefined behavior. There are a few cases where constants were changed to fix this, e.g. using (1U<<n) instead of (1<<n), but strangely that change didn’t get propagated everywhere and it appears there are still a couple cases in the code that might be UB. Specifically, the variation of variable-width sign-extend in 3 operations looks crazy. Aren’t there multiple UBs in this one?

    unsigned b; // number of bits representing the number in x
    int x;      // sign extend this b-bit number to r
    int r;      // resulting sign-extended number
    // The following variation is not portable, but on architectures that employ an arithmetic right-shift, maintaining the sign, it should be fast.
    const int s = -b; // OR:  sizeof(x) * CHAR_BIT - b;
    r = (x << s) >> s;
Not sure if I’m missing something, but it looks like this relies on shift by a negative number (UB), potential left shift of a 1 bit into the sign bit (UB), and potential right shift of a negative number (implementation defined).

Isn’t this much worse than non-portable, but guaranteed to be undefined?


in the same spirit and an alive project:

https://jjj.de/fxt/fxtpage.html



Decades ago, I discovered a compiler for an obscure microprocessor was generating a subtraction operation to test if two integers are equal. I then turned the optimisations on and it generated an XOR instruction instead.

I suspect a lot of these hacks are used by an optimising compiler.


I’m surprised the yearly advent of code never requires any bit twiddling.



Oh is it this time of the year already?


with this bitwise magic, my code will find the sign faster than you can say "boolean operation!" who needs branching when you've got clever tricks up your sleeve? hate to say it, but this beats the obvious way, hands down!


would modern compilers not convert the simple boolean operations into clever bit twiddling hacks for you?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: