Hacker News new | past | comments | ask | show | jobs | submit login

No need to pack bits. Use whole words. The table with 64 entries is small, and if the operation is performed often, the table will always be in L1 cache. Table lookup will take 1 cycle. The method with shifts takes 4 cycles. (It can make sense only in tight loops. The gain is really negligible in absolute terms in the large scheme of things)

Timing of instructions for Intel can be found here: www.agner.org/optimize/instruction_tables.pdf




I know, but the person I replied to specifically said "it fits in three 64 bit integers" to point out you could squeeze the table into three words. I just tried to figure out how that could possibly be advantageous here (I don't think it can).

Best I can come up with is using 32 8-bit integers (so 6h bits more than 3 times 64 bitss in size) and doing thir:

    (table[nlz >> 2] >> (nlz & 3)) & 7
… which is two shifts, two ANDs and one lookup. I'm fairly certain that two shifts and two additions beat that. Well, woulo beat that if everything else in the code wasn't likely to be a much bigger bottleneck that dwarfs the impact of this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: