Yes! This and only this! The posted "bit hacks" are little more than bit level m...

cozzyd · on May 11, 2018

That's really clever, but extremely inefficient. Something like 1 << (sizeof x - __builtin_clz(x)) should be way faster.

Edit: you obviously have to multiply by 8 to convert bytes to bits

glangdale · on May 11, 2018

By the unwritten conventions of bit hacks folks, 'builtin_clz' is not a primitive (and elsewhere in that same document you will find a brace of different ways of doing clz/ctz). Many of these documents are old and the techniques in them older still (dating back to HAKMEM).

You are definitely correct on most modern architectures which have a single operation count leading/trailing zeros that are typically no more expensive than an integer multiply. This has been latency = 3, throughput = 1 for most recent Intel arch.

I think a redo assuming modern architectures is long overdue; i.e. let's work off the basis that count leading/trailing zeros, PDEP/PEXT, etc are fast and cheap and start from there.

Bonus round: if you had a lot of these operations to do, then you still might be able to go faster on x86 on SSE2 or AVX2 by using this trick, as the penalty to pass a SIMD register full of integers to a floating point op or vice versa is just one cycle. This trick becomes obsolete yet again with AVX512, which introduces parallel count leading zeros.

AstralStorm · on May 11, 2018

And neither of the bit twiddling is useful for ARM NEON as bit operations in vector form are very limited... (plus there are pipeline stalls)

Also multiply adds can be fused so if you're doing that it cab be faster to just multiply by a different number instead of bit twiddling.

glangdale · on May 11, 2018

I'm not sure this is true. Many of the bit twiddling hacks can be used in NEON and they have a few unusual instructions I'm dying to play with.

I'm not sure which bit hack you're talking about that's done with a multiply or multiply-add. There's a nice use involving De Bruijin sequences for doing lg2 of a single bit that's very instructive - is that what you meant?

saagarjha · on May 11, 2018

Of course, casting that int * to a float * is undefined behavior…

zyx321 · on May 11, 2018

Of course.

When you first learn about 'cool hacks' like these, you'll want to use everywhere.

Then you'll realize that they don't apply to certain data types in certain environments.

Then you'll come to realize that the rules governing which hacks to use are too complex to remember.

Then you'll want to write a program that figures it out for you.

Then you'll realize that those programs already exist, and they are called compilers.

So now you are back where you started. You have made no progress, but you have gained an important insight.

TeMPOraL · on May 11, 2018

And from that, at some point you realize that the corollary to your #2 says that such hacks do apply to certain data types in certain environments. And then you've gained a very powerful tool to be used in specific circumstances.

There's a reason a lot of these potentially non-portable tricks show up in high-performance computing (particularly in videogames) and in the embedded space.

zyx321 · on May 11, 2018

That reason being that sometimes spending a 10+ hours on a tweak that gives +0.1% performance improvement is worth it.

Otherwise, remember the first rule of code optimization for beginners: "Don't."

EDIT: And by "they don't apply", I mean they can crash your program or even silently corrupt your data.

kaslai · on May 11, 2018

Sometimes tweaks like these might only save 10-20 cycles, which in a vacuum doesn't sound like much, until you consider that before the tweak, it started at 26 cycles and is now 6 cycles, and it's called 50 million times every second, which is a savings of 100 million cycles per second.

For games, this can mean a higher frame rate. For data processing, it can mean the data gets processed significantly faster. It's especially important for things like video encoders, where even a 1% savings in CPU time can translate directly to 1% more potential revenue for the same operating cost at a company that relies heavily on video encoding.

Yeah, saving those cycles doesn't really mean anything to a beginner, but they can be a huge savings for high performance computation applications.

ithkuil · on May 11, 2018

Yep the correct way is:

union { int i; float f;} u; u.i=*i;

Now u.f contains the same bit pattern as the integer pointed to by "i" but interpreted as a float.

Compilers know how to generate code efficiently out of this.

Iirc this is implementation defined behavior, e.g. things like endianness alignment or padding, but it's not undefined behavior, whose evil effects can bubble up in unexpected parts of the program that apparently don't even touch the bad cast.

scatters · on May 11, 2018

No, that's also incorrect. The correct way is:

    float f;
    memcpy(&f, i, sizeof f);

Don't worry about the implied overhead of the call to memcpy; the compiler will replace it with an intrinsic and emit the expected movss instruction.

saagarjha · on May 11, 2018

Accessing through a union is legal in C, no? I thought that this was only disallowed in C++.

gpderetta · on May 11, 2018

Maybe? There is wording in the C standard (since C03 TC3 I think) that allow reading accessing from non active members of the union, but the wording is unclear and implementations have taken a very narrow interpretation of it (i.e. the access path need to be explicitly through an union type and the actual object needs to be an union). See the GCC documentation on union and aliasing for example.

jcranmer · on May 11, 2018

It's explicitly legal in both C99 and C11, but C99 forgot to remove it from its master (non-normative) list of undefined behavior. It's up for debate whether or not it's legal in C++ (ISTR a C++0x draft that copied the C99 changes to unions, but then they added unrestricted unions and completely changed the verbiage to have almost no relation to the C99 text).

tropo · on May 11, 2018

It mostly isn't legal, though gcc explicitly promises to support it and Microsoft's compiler doesn't even do the optimizations that would cause trouble.

TeMPOraL · on May 11, 2018

Wow, never realized the pointer cast version is undefined behaviour in C. Is C++ equivalent via reinterpret_cast undefined behaviour too?

Back in my C++ days, I would much prefer pointer-cast to union-struct, because the latter was subject to alignment/padding issues you mention, while for the former, I know of no desktop architecture that could mess up a int* -> float* cast.

I'm also doubly surprised because being able to do that stuff is what I'd consider the selling point of C/C++ over higher-level languages.

jcranmer · on May 11, 2018

Pretty much all pointer casts where you would use reinterpret_cast in C++ are undefined behavior, in both C and C++. The underlying rule for strict aliasing is essentially the same in both languages (you can only access an lvalue via a pointer of its type, with possibly differing signedness, const, volatile, or via a char pointer, or via a field of a struct that has the same initial layout if you're in that same initial layout portion).

Yeah, C and C++ aren't the language most people think they are, and the strict aliasing rules are the most clear manifestation of that.

tropo · on May 11, 2018

Trouble is, C was that language. Lots of people came to depend on this. The world needs a language with this ability.

The replacement is often "gcc with inline assembly" or "gcc with lots of __attribute__ extensions such as __may_alias__" or "gcc with optimizations disabled".

The need doesn't go away, despite the wishes of a language committee now dominated by compiler suppliers. From day 1, long before the language was standardized and even before the original (not ANSI) book was published, C was being used with aliasing. It was, after all, developed to support the creation of Unics (before that became UNIX).

saagarjha · on May 11, 2018

reinterpret_cast in C++ has essentially the same rules, with exceptions made for const qualifiers and such, as well as void , char , and std::byte *. Use a memcpy instead.

saagarjha · on May 11, 2018

Nope, it's undefined:

> Accessing an object by means of any other lvalue expression (other than unsigned char) is undefined behavior

https://wiki.sei.cmu.edu/confluence/display/c/EXP39-C.+Do+no...

ithkuil · on May 11, 2018

I don't understand your comment. The link you pasted explicitly mentions the union example as a "compliant solution".

That said, memcpy and a compiler that understand the memcpy intrinsic is probably safer; that said I have a vague memory of hitting an embedded compiler where memcpy wasn't an intrinsic

rcxdude · on May 11, 2018

This is technically undefined in C, but most compilers define it to work as expected as an extension to the language.

vvanders · on May 11, 2018

While that is a clever hack if you're looking for performance(which is where I've used these before) the float to int conversion normally hoses your pipeline(depending on platform obviously) and the solution below is much faster on any system I've worked with.