Hacker News new | past | comments | ask | show | jobs | submit login
Fibonacci Hashing: An optimization that the world forgot (2018) (probablydance.com)
204 points by djoldman on April 29, 2023 | hide | past | favorite | 63 comments



I first learned this trick for generating hues for colors when you don't know how many you'll need.

I think LibreOffice might do this. It always picks blue and orange for the first 2 series in a chart. But you can keep picking and it always has more colors that are just slightly different than all the previous ones.


Multiplication is bad. Knuth actually also describes a hash function using random numbers and XOR. Its 10x faster than the modulo, and I belive Mikkel Thorup proved it optimal.

The idea is roughly:

Say you have a hashtable of size 1024. You then create x uint arrays of size size 256. These arrays you fill up with random numbers 0-1023.

To get your hash value, you take your input and for i=0..x-1 determine byte k=input[i] you lookup the value in array[i][k]. These lookup values are then XORed giving a final random value between 0-1023 ready for inserting into the hash array.

No modulos. No multiplications. You only have to redo the random tables when the size changes from say 1024 to 2048. Easy peacy. Superfast.


“Superfast”, until you blow through your L1 cache, which happens pretty early on if you need 1 kB of table per byte in your key.

Even in the L1 cache, it's hard to beat the mul: A multiplication (which can hash multiple bytes in the case of Fibonacci hashing) has 3 cycles latency on modern x86. A single load, even from L1, is 5, I believe.


How much of a problem 3 cycles of latency is depends on what else your processor is doing. It might not be any problem at all.


Word. It depends on access patterns of course but yeah, L1 is a valuable resource.


Modern CPU cores can perform a multiplication and addition every clock tick. Heck, I'd expect a modern Zen4 core to be able to do like 4 parallel 64-bit multiplications per clock tick on it's integer pipelines, and maybe 32x parallel 32-bit multiplications per clock tick on it's vector pipelines.

Multiplications we're bad 40 years ago, but the year 2020 called and FMAC is incredibly optimized today.

You should still avoid integer division (floating point division is commonly optimized as reciprocal and then multiply). But multiplications are really really fast at least as far back as 2008 or so.

-------

I'm pretty sure multiplication's latency is only 5 clocks, but with all the out of order processing that occurs on modern cores, latency of just 5 ticks is rarely is the bottleneck. (A DDR4 memory load is like 200+ cycles of latency. You shouldn't even worry about 5 cycles like multiplication, especially because those out of order cores will find some work to parallelize in that time).

-----

> you lookup the value in array[i][k]

You know a L1 cache lookup these days is like 4 cycles of latency right? And I'm pretty sure you have fewer load/store units than multiplication units. So a load/store, even to L1 cache, might use more resources than the multiply.

Might, I'd have to benchmark to be sure.


Indeed, division just doesn't have a parallel algorithm, unlikely mul and add. So it's bound to be 'slow'. About 2008 - Intel core 2 (2006) had 3 cycle mul. Edit: Pentium Pro(1995)'s imul was 4 cycles. 386's imul was slow, though.


This sounds like zobrist hashing, or related . https://en.wikipedia.org/wiki/Zobrist_hashing

" Zobrist hashing is the first known instance of the generally useful underlying technique called tabulation hashing. "

so to https://en.wikipedia.org/wiki/Tabulation_hashing

"

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; [...]

Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: [...]

Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including hopscotch hashing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections.

"

further

" Method: The basic idea is as follows:

First, divide the key to be hashed into smaller "blocks" of a chosen length. Then, create a set of lookup tables, one for each block, and fill them with random values. Finally, use the tables to compute a hash value for each block, and combine all of these hashes into a final hash value using the bitwise exclusive or operation.[1]

"


How come mul is bad? It is a low cycle latency - Skylake had 3cycles per imul, mul r32 - a single cycle. Div is bad but mul is great.

edit: Memory access (along with div) is pretty much the only slow operation in modern CPUs -- extra pressure on L1 just to have random number is not smart at any rate, heck Marsaglia's xor (random) is likely cheaper than accessing L1, very likely all the latency to be hidden behind the memory access.


Multiplication was bad on decades old CPU's.


Others have hinted at this, but to be clear: This algorithm is slow, even in the optimal case where the tables are in cache. On new X86 CPUs it is theoretically limited to less than 2 bytes per cycle. Probably somewhere around 1.5 for an implementation that loads 8 bytes of input at once and shift though them in order to limit load on the load slots.

Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.

This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.


How do you fill the array though? Wouldn't filling it with random numbers give you a different hash each time you rebuild the hash function? I can see it being useful for a short-lived data structure, but you wouldn't be able to use it as a shared deterministic hash function?


For in-memory tables you rarely need determinism across instances, let alone runs.


Why couldn't you fill it deterministically?


I guess that was my question indeed. In the sense of how do you do it in practice? I suppose there are pseudorandom algorithms that can be easily applied.


Pseudorandom bit sequences PRBS are deterministic and very easy to implement (just linear feedback shift registers).

Something similar is actually done in communication systems, with scrambling, to prevent long strings of transmitted ones or zeros (which cause issues for some of the hardware components). Essentially you just add or multiply the data with the PRBS sequence. At the receiver you just do the reverse operation.


The replies to this command are top HackerNews: every commenter is a bigger expert than Donald Knuth but nobody quotes any actual benchmark results to go with their theories.


It isn't fair to assume the same rigor on a comment.

But, you don't have to be a bigger expert than Knuth to dismiss an optimization done for hardware say 40 years ago (the circumstances around this particular case I don't know).

Even in that case though it might still be relevant for embedded CPUs.


So it’s just some unfounded handwaving? You can just dismiss one of the greatest minds in computing science by just blathering in a comment, because then it’s ‘unfair to assume rigor’?

If it is all so clear and all the armchair experts here have ample experience in the field like they pretend, why is it so hard to run a few benchmarks?


It's not just handwaving. It's knowing the context. As late as the mid 80's it was not unusual for multiplication to take tens of cycles if your CPU even had a built in multiplication instruction (e.g. on the M68000 in the Amiga, a MULU - unsigned 16 bit -> 32 bit multiplication - took from 38 to 70 cycles plus any cycles required for memory access; if you needed to multiply numbers larger than 16 bit to depend on the overflow like in the case of this article, you'd need multiple instructions anyway), far worse if it didn't, while memory loads were often cheap in comparison (on the M68000 a memory read indirect via an address register could be down to 8 cycles), and so until years after that it did make sense to do all kinds of things to reduce the need for multiplication. But it doesn't any more.

While it'd be worthwhile doing tests to confirm a specific case, the default assumptions have changed: Today memory is slow and multiplication fast (in terms of cycles; in absolute terms both are of course far faster).

You certainly should not today pick a more complex hashing scheme to try to avoid a multiplication without carefully measuring it just because it was discussed even by someone as smart as Knuth in a context where the relative instruction costs where entirely different.

If you're actually using the function as the primary hash function, then the distribution of the output might well make up for significant performance difference, so this is not to suggest that tabulation hashing isn't a worthwhile consideration.


How is it unfounded? How is it unreasonable to question the relevance of an optimization from a time where the relevant parts of computing were completely different?

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.", Donald Knuth

Maybe doing benchmarks for a comment isn't worthwhile. I guarantee you have to do different benchmarks for different contexts anyway so I can't blindly trust the benchmark anyway. Not to say it wouldn't be interesting.

I wouldn't mind someone plotting the cost of instructions over time and how that affects choice of algorithms. But to expect that from a comment?


So here’s a paper where the author has run the experiment:

https://arxiv.org/pdf/1011.5200.pdf

And the answer is that the speed is comparable to other functions that however produce worse results.


A couple of issues with that interpretation:

1. Dated hardware, so the hashing algo speed comparison is no longer relevant without redoing it, but even on hardware that old a 2.2x-2.8x speed advantage for mul+shift is substantial.

2. No tests with contention for the cache; no tests with different table sizes; no code given. As a result it's impossible to tell if the performance numbers are relevant and realistic.

3. If they could demonstrate substantially better distribution, it might still be very worthwhile despite how much slower it is, but they test the hashing algorithms with runs of 100 random constants. We don't know if any of those constants are any good because they've not given them, but odds are highly against 100 random constants even approaching good. As such the comparisons of tabulation hashing with the other hashing method is meaningless in terms of performance (but see below) - it's trivial to find constants for multiplication + shift that produces pathologically bad outcomes.

What the paper does appear to show is that tabulation hashing might have more predictable runtime given the result on the specific set of structured input they test with, and that might well be a good reason to use it for some applications.

But that is tainted by the lack of transparency in what they've actually compared against.

(This is also mostly relevant if you considering using a multiplication-shift based hash function, which is also not what the original article is advocating you use Fibonacci hash for)


The hardware is not that dated, they specifically target then modern hardware that has the fast multiplication everyone is on about. And while multiplication has gotten faster, of course the size of the l1 cache has also grown. It is per-core so there is not going to be a whole lot of contention going on. A modern budget cpu has about 32kb per core so it’s not going to be a squeeze.

I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

Then again someone else probably has done the work more recently or more in line with what you want to see. I found this paper in a few minutes or websearching, I’m sure you can spare the time to find a better one.


> It is per-core so there is not going to be a whole lot of contention going on.

That's only true if nothing else happens between requests to the hash table. That might be the case, or it might not. Depending on your workload that might make no difference or totally ruin your performance characteristics.

> I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

The problem with this is that while it may well have better characteristics than multiply and shift on average, the quality of the distribution of multiply and shift based hashes can vary by many orders of magnitude depending on the choice of multiplication factors. Put another way: Multiplying by 1 and shifting is a perfectly valid multiply and shift hash function. It's a very stupid one. The performance characteristics for a hash table doing that is nothing like the performance characteristics of what is proposed in the original article. I have no doubt that the table based approach will beat a large proportion of the multiply and shift hashes. But so does other multiply and shift hashes, by large factors. As such, without actually comparing against a known set of some of the best multiply-shift hashes we learn very little about whether or not it'll do well against good multiply-shift hashes.

To me, the fact that they chose random factors is very suspicious. Nobody uses random factors. The effort spent on choosing good factors over the years has been very extensive, and even hacky, ad hoc attempts will tend to use large prime numbers.

> You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

I've not made demands for anything. I've pointed out that making a blanket claim that multiplication is bad when the performance characteristics has changed as much as they have is unreasonable, and a paper like this tells us pretty much nothing more. It's absolutely reasonable to consider table based approaches; it's quite possible, even likely they'll have desirable properties for various sets of inputs - there is no such thing as a perfect hash function for all inputs, and sometimes you care most about pathological worst case scenarios, some times you care about averages, some times you know what data you will or won't see. That it performs as well as they've shown it to means there is almost certainly situations where it will be a good choice. But because of the choices they made in that paper we can't really tell when and where that would be, and that's a shame.

What is not reasonable is just writing off the use of multiplication on the basis of hardware characteristics that are decades out of date. That doesn't mean you should blindly use that either.

If there's one thing people should know about working with hash tables it's that you should test and measure for the actual type of data you expect to see.


Sorry, all I can do is suggest you come back tomorrow and reread the comment you wrote.


I have and stand by every word, but seeing as you have no arguments and resort to implied insults we are done here.


Sorry, if you start making up giant arguments against claims I have not made there really is no response. But clearly you don’t see that.


It is funny, the article claims to test the 64 bit code on "Dual-core Intel Xeon 2.6 GHz 64-bit processor with 4096KB cache". That is a really poor description, as it does not tell us what architecture the processor is. But one can go through a list of all Intel Xeon processors to find the ones that match the description. Turns out that there are none.

If we broaden the search to 2.66 GHz processors there are 4: 5030, 5150, 3070 and 3075. All released in 2006 and 2007. This means it is either one of the last "NetBurst" CPUs or one of the first "Core" CPUs. Assuming "Core" the relevant operation has a 5 clock latency, as best I can tell. This is down to 3 clocks on pretty much all modern X86 CPUs. Modern CPUs also get an extra load port, so I doubt the relative difference is much different on modern CPUs.

Overall it looks like a pretty bad benchmark, thrown into a paper on collision likelihood, which itself looks like an academic exercise with no relevance for the real world.


Embedded CPUs would have much worse memory access latency, and a lot less memory to spare - so if anything wasting memory on tables is likely to perform worse as well.


How do you mean? Measured in cycles embedded devices typically have less latency to SRAM.


You're right, I guess. On devices where the memory (sram) tends not to have its own clock (and there is no OOO), it can effectively be one cpu cycle. PIC and ESP-32 comes to mind. If there is 'extra' not on chip memory, it'd be way worse, of course.


Why? Eg https://news.ycombinator.com/item?id=35760954 quotes some (vague) numbers.


Discussed at the time:

Fibonacci Hashing: The Optimization That the World Forgot - https://news.ycombinator.com/item?id=17328756 - June 2018 (75 comments)


Interesting article, but I'm not convinced about this idea.

This is just fixing up the hash function, which should have a good enough distribution from the start. And if it did this remapping wouldn't be necessary.

The proper place to add this is in your hash function, to "fix up" the distribution. But then, if your hash function doesn't generate a good enough distribution without this trick you should probably redesign it.


This misses the point which is two-part:

1) A lot of hash tables uses modulo to map a large range to a small range, and this is slow.

2) Fibonacci hashing is according to the author both faster and as a side effect happens to improve on bad inputs.

Yes, you should improve your hash function, but assuming the author is right you should still prefer this over integer modulo to reduce the size of the input hash value to the size of the hash table if you can as a default because it's faster, and the extra mixing is just a bonus.

Also, for library implementations where the implementer has no control over which hash function the user provides, being resistant to pathological inputs is good. Especially if you get it as a free side effect of a performance improvement.


> Yes, you should improve your hash function, but assuming the author is right you should still prefer this over integer modulo to reduce the size of the input hash value to the size of the hash table if you can as a default because it's faster, and the extra mixing is just a bonus.

What we're comparing here are these two:

    // We're only dealing with sizes of num_slots = 2^b

    int key_to_slot1(key) {
       return hash(key) & (num_slots - 1);
    }

    int key_to_slot2(key) {
       return (hash(key) * k) >> (64 - b);
    }
The only difference is: (1) we're doing an extra multiplication, and (2) we're using the high bits instead of the low bits.

This should _only_ help if you use a terrible hash function. I'm also not sure why they claim that integer modulo is slow when the* show how to make it fast by using bitwise AND.

> Also, for library implementations where the implementer has no control over which hash function the user provides, being resistant to pathological inputs is good. Especially if you get it as a free side effect of a performance improvement.

But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function. We've learnt this before: Instead of letting the user provide `int hashCode(Key key)` (as in Java), we use `void hash(Hash state, Key key)` (as in Rust). The user will tell us which sources of data needs to be hashed, but we can provide the hash function and mixing functionality.


> This should _only_ help if you use a terrible hash function.

So in other words it helps. The point being that when you're writing a library where you have no control over what people will pass you, you can either punt on doing the best you can and leave it to be the users problem, or when - like in this case - there is a trivially simple option that makes things better you can make it better.

> I'm also not sure why they claim that integer modulo is slow when the* show how to make it fast by using bitwise AND.

They show that integer modulo is slow because a lot of code uses the actual modulo rather than bitwise and (granted often because they assume they need a prime sized hash table to do well).

EDIT: Note that the article specifically also addresses the bitwise and by pointing out that Dinkumware in using that 1) had to compensate by using a more expensive hash for integers, 2) created a situation where implementers providing a custom hash had to be aware that this specific implementation would just throw away the high bits, and pointed out they'd been bitten by that issue several times. It also mentions how Google's google::dense_hash_map did the same thing but without even providing it's own integer hash replacement to compensate.

> But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function.

He's comparing it to the C++ stdlib. An implementation that is intended to be a drop-in replacement for that needs to abide by similar guarantees, so it is possible to provide a custom hash function for the set of contracts in the libraries he compares with, and it's fairly common to do so.

The logical conclusion is that there is a reason for him to make it possible for the user to provide a custom hash function, because it's what the clients he wants to support expects.

Sure when you have control over the API and you are free to refuse to let users of the API set custom hash functions, just provide a better hash function.

But that is explicitly not the case this article is addressing, so you're arguing against a strawman.

The argument that another solution to this issue is to just refuse to let users provide the hash function would be entirely reasonable, and maybe it'd have been nice if the article started by addressing that.


> But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function.

Technically Rust users are also welcome to provide a custom hash function, it's just that it's defined in terms of an API called by all the stuff which wants to be hashed via the trait named Hasher. You can get a drop-in FNV1a hash for example: https://crates.io/crates/fnv

The mechanics to make all this work in Rust involve a proc macro, so people who want to be able to put a Foozle in a HashMap (or an FNV HashMap, or their own custom hashtable) just write #[derive(Hash,Eq,PartialEq)] above the definition of Foozle telling Rust yeah, just look at the constituent parts of this type and use them to derive functions for equality and hashing the whole type. C++ macros aren't powerful enough to do that, so you'd need a more powerful reflection mechanism and templates to do it I think.


In Stockfish, multiplication is used and then the lowest 16 bits are stored in the entries to detect collisions. This is because storing the full hash, or god forbid the board state, would double or ntuple the overhead per entry, which is itself 64 bits not counting the 16 bit key.


So does this mean Stockfish can have situations where it didn't know the difference between board position A and board position B because they had the same key ? This seems like an opportunity for some (presumably small) fraction of completely spurious losses where Stockfish is pursuing position X (a clear winner) by taking steps towards position Y (a loss which may look very obviously different to a human) and only realises when it's too late.


Yes, but this has much less of an impact than you might think. The transposition table is used to cache the evaluation and best move of previously looked at positions, which can speed up the search, by guiding it in the right direction. But probably the worst thing that can happen if it gets data from the wrong position is that it wastes some time looking at a bad move(also quite likely is that the move is illegal and it's just skipped), and then has to re-search the position because the result made no sense. Even if in theory it could lead to a misevaluation, ultimately Stockfish' move is decided by a vote between different search threads, so there would have to be a conspiracy between multiple threads as well. It's pretty unlikely they'll all see the same collision because they're set up to emergently end up in different parts of the search space. And the transposition table is constantly being updated.


AFAIK most high performance modern hash tables use power of two sizes anyways.


Using the Fibonacci buckets in a hash table fixes the "accidentally quadratic" behavior when resizing a hash table. There are other possible ways to fix this problem too, but this trick described here will work.

Using a better hash function with a better distribution is orthogonal to mapping hash values to buckets in the table.

https://www.tumblr.com/accidentallyquadratic/153545455987/ru...


One property I find interesting about this scheme is that resizing the table (doubling or halving) could be done without recomputing any index — and therefore be superfast compared to the mod-prime method.

Each bucket in the smaller table maps to precisely two buckets in the larger table. The differences are in the lowest bit in the larger table's indices.

If you'd store the product in each entry then you wouldn't need to redo the multiplication when resizing. I suppose that the product would work just as well as the hash value when looking up an entry in a bucket, so you could just store that instead.


This is also true for power of two hash tables, but it doesn't work anymore once you use open addressing.


Along the same lines there’s Dijkstra’s use of the Leonardo numbers in smoothsort. I doubt that one in a hundred working computer programmers have heard of them.


Can you share a link?


The first source quoted in the Wikipedia article* for Smoothsort is a link to a transcript of Dijstra's discussion/presentation of it.

* https://en.wikipedia.org/wiki/Smoothsort#cite_note-EWD-796a-...


This style of hash function is used by zstd, lz4, and many other LZ algorithms in their hash tables as a fast hash with good enough quality


They hardly forgot it. It's just that today we prefer hash functions that are seeded, so we use a random number instead of the "magical" fibonacci constant.

See for example multiply shift hashing in https://arxiv.org/abs/1504.06804


The article is not about using Fibonacci hash to replace the hash function, but to replace integer modulo when remapping the hash value to smaller space.


Yet, no one in their sane mind uses module on hashing, it's always power of 2 table, so the operation is just: hash & (len -1) which is absolutely trivial. It has been this way for decades.


And yet just here in this article someone asked how it compares to a recent hash implementation that offers non-power of two tables and uses modulo if you let it do that, and I have seen many other recent hash table implementations do so.

But that is also not the only issue: If you use powers of two and resort to bitwise and to pick the bucket, you're even more dependent on a good hash function, which may be impossible to guarantee if implementinf APIs that let the user choose hash function as in the article.

Unless you follow the suggestion in this article and uses this method to pick buckets.


Personally I do murmur smear that involves a RoL, an add and two muls on all input hashes.


To the contrary almost everybody who knows a bit about DDOS security uses primes, not power's of two. This has been this way for decades, see the relevant hash tables for glibc, gcc, binutils, ...


The common "Multiply Shift" hash function is exactly what the article calls "Fibonacci Hashing" except for which constant is being multiplied by. That "Shift" part of "Multiply Shift" is what I assume you're referring to with "remapping".


Only partially, in that the number of bits to shift needs to change if the hash table is resized. And so if your hash table also controls the hashing of the keys, those steps can be combined. But it's often the case that its not in control of that, and so they're often separate steps.


Knuth has a great few pages on Fibonacci Hashing TAOCP Vol3 pp 510 - 512


How does this compare to tsl::robin_map?


Skimming the code, it seems to use bitwise and with a mask if the hash size is a power of two, which has the same caveats as described for Dinkumware (it's fast, but if your hash is poor it can have awful results), and uses modulo if not which has all the issues the article describes. This is encapsulated in a single file [1] so it looks like it'd be easy to improve.

[1] https://github.com/Tessil/robin-map/blob/master/include/tsl/...


TLDR; hash the hash with fast function, like fibonacci hash which is just single multiplication by a constant, then take lower bits as your bucket index.


No, you take the highest bits. The product is a fixed-point real fraction. The integer part has overflowed out of the machine word. 0..(2**64-1) represents 0.0<=x<1.0 Picking the n highest bits is like multiplying with 2**n and using the integer result.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: