I Wrote My Fastest Hashtable

ot · on Feb 27, 2017

The author seems to ignore that GCC's implementation of std::hash is the identity on integral types: https://github.com/gcc-mirror/gcc/blob/cf8c140a8b8131b3ab4f5...

This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.

The solution is to use better hash functions, not prime hash table sizes. Prime modulus will only make the problems harder to notice, but if you have a poor hash function, you're giving up every complexity guarantee. Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.

Personally I just use folly::Hash (https://github.com/facebook/folly/blob/da08e63ce3bac9196b18d...) as it uses sane defaults for hash functions, and contrary to std::hash it's a polymorphic functor so you can just drop it as template argument without having to specify the key type: dense_hash_map<int, int, folly::Hash>.

The other reason dense_hash_map uses power-of-2 sizes (other than to avoid the modulo) is that it is sufficient to guarantee that the quadratic probing schedule is transitive, that is, it can reach all the slots of the table. Quadratic probing is used because it's very effective for unsuccessful lookups. You can find a table of expected probe counts at different load factors here: https://github.com/sparsehash/sparsehash/blob/4cb924025b8c62...

Hash tables are a fine craft. While it's interesting to gain these experiences first-hand, this is an amateurish post and it should be taken with a grain of salt.

dzdt · on Feb 27, 2017

I think "amateurish" is overly harsh here. He sets out a very professionial plan of measuring and testing. It is a nice post.

Yes, probably true that requiring a non-broken hash function is strictly better than his preferred prime modulus workaround. Okay, so? If you notice, the whole post keeps comparing a power-of-two version of his hashtable as well. Other than the one graph showing pathological behaviour of dense_hash with in-order inputs, everything in the post applies assuming a reasonable hash function.

ot · on Feb 27, 2017

I didn't mean "amateurish" in a judgmental way, it's a matter of fact that the author is a novice when it comes to hash tables. He himself admits that he doesn't understand some of the details ("Why do prime numbers help? I can’t quite explain the math behind that"), so he gets some of the conclusions wrong.

> Other than the one graph showing pathological behaviour of dense_hash with in-order inputs

That's a pathological behavior of std::hash, not of dense_hash.

> everything in the post applies assuming a reasonable hash function.

No, see this paragraph:

    All of these problems are solvable if you’re careful about choosing a hash
    function that’s appropriate for your inputs. But that’s not a good user
    experience: You now always have to be vigilant when using
    hashtables. Sometimes that’s OK, but sometimes you just want to not have to
    think too much about this. You just want something that works and doesn’t
    randomly get slow. That’s why I decided to make my hashtable use prime
    number sizes by default and to only give an option for using powers of two.

The belief is that prime number sizes will defend you against poor hash function. That's a very dangerous advice.

cjhanks · on Feb 28, 2017

> This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.

Nonsense. GCC developers have absolutely no way of knowing or guessing how the distribution of all integers map into the domain of positive integers. And any guess (without knowledge of the specific application domain) would have as many degenerate cases as the identity.

Maybe heuristically Facebook found some better general purpose hashes for their use case. But "clowntown" is a denigration.

tveita · on Feb 27, 2017

I read the section on sequential numbers as just illustrating a worst-case scenario of the case that could happen with either a naive hasher or malicious input.

As far as I can see all the graphs but one use random numbers, so hashing won't add anything given good randomness.

I've heard good arguments why linear probing should perform as well as quadratic probing when using Robin Hood bucket stealing, e.g. http://codecapsule.com/2013/11/11/robin-hood-hashing/#commen...

> Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.

Which can be great! E.g. Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.

ot · on Feb 27, 2017

> Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.

That's a good idea when you control both the hash function and the hashtable implementation, as in the Python interpreter, so you can make assumptions on both sides. My point was that when using composable templates, as in the STL, the hash function should not make any assumptions about the hash table, and vice versa.

zzleeper · on Feb 27, 2017

> Quadratic probing is used because it's very effective for unsuccessful lookups

IS it really that good?

If you end up having to probe a lot with linear probing, it is likely you have a bad hash function, which is the same reason you give against prime modulus (that they are a stopgap more than a solution).

Second, don't you end up having to jump all over the place, which means cache misses, that are way more expensive that a few probes on consecutive slots?

thomasahle · on Feb 27, 2017

I found the lecture notes that made this claim on internet archive https://web.archive.org/web/20031127062932/http://www.august... It doesn't give any sources unfortunately.

dekhn · on Feb 27, 2017

The documentation for std::hash in GCC clearly states the integral type identity property, gives a reason for it, and suggests it's not the best type for some circumstances.

That means it's not amateurish: it was an intentional decision, likely by somebody who implements hash functions for a living, who gave thought to what the appropriate default should be. It is a trivial perfect hash, of course.

dkersten · on Feb 28, 2017

ot didn't say that GCC's std::hash is amateurish, (s)he said that the article is: "...this is an amateurish post and it should be taken with a grain of salt..."

(And clarified in another reply that it was meant only in respect to hash-tables, not the authors programming skill in general)

(s)he did call it an incredibly poor design choice, however. Which it may or may not be, regardless of whether GCC developers are amateurish or not.

dekhn · on Feb 28, 2017

Yes, I misread the post and the replies. You can sed my reply s/amateurish/poor design choice/ and the point stands.

79d697i6fdif · on Feb 27, 2017

Ouch, this is much more thorough than my attempt to debunk the post.

Trundle · on March 1, 2017

Hey, this really belongs on one of your older comments about you starting a blog but the comment was too aged to reply.

Can you add me to the list of people to contact when you get it off the ground? Cheers.

prostorm.blue@gmail.com

_benedict · on Feb 27, 2017

This is great, and I don't want to take away from it, but the idea that limiting the number of probes before a rehash is a new contribution is mistaken.

I've certainly done this in hash tables I've implemented at times, and I've seen this in library hash tables too.

One public well-known example that springs to mind is Cliff Click's NonBlockingHashMap in Java.

bradleyjg · on Feb 27, 2017

One approach I've read about is to keep a metric for the number of probes as it relates to the overall capacity of the table. If a single bucket starts to fill up while the table itself is at relatively low capacity it is assumed that the data structure is under attack and it is rehashed with a DoS resistant hash function. This allows a fast, but non-resistant, hash function to be used most of the time without incurring a huge penalty in the pathological case.

3pt14159 · on Feb 27, 2017

Why is it mistaken?

tossaway1 · on Feb 27, 2017

I feel like you misread his comment if you're asking that. His next sentence explains why it's mistaken...

3pt14159 · on Feb 27, 2017

I missed the word new, thanks. Deleted the comment.

Edit: Apparantly I can no longer delete the comment. Stupid HN.

grzm · on Feb 27, 2017

As 'dpark noted, you can't delete a comment once it's been replied to: I believe this is to ensure that there aren't dangling comments. If the comment is still new enough (less than 2 hours old, I think), you may still edit the comment. For example, in this case you could chose to include your note about missing the word new in that comment itself.

dpark · on Feb 27, 2017

You cannot delete a comment once someone replies to it.

dkersten · on Feb 28, 2017

Nor should you. It breaks the conversation for anyone reading it afterwards and makes the replies essentially off topic.

setr · on Feb 27, 2017

...because its not new

cat199 · on Feb 27, 2017

And with one fell swoop, the thesis went pop

jlrubin · on Feb 27, 2017

Author may be interested in http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-m... as a way of avoiding modulo altogether.

79d697i6fdif · on Feb 27, 2017

The author seems mistaken about a couple things relating to hash tables. He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.

He is also going to run into modulo bias since using the remainder to calculate a slot position is biased toward certain positions for every table size that isn't a power of two. (see https://ericlippert.com/2013/12/16/how-much-bias-is-introduc... for cool graphs) Prime number table sizes do nothing to fix these issue. The power-of-two size with a bitshift is not just faster, it gets rid of the modulo bias.

Fastrange is much faster than modulo but if his goal is to build the fastest hashtable it's stupid to use a table size of anything except a power of two.

Source: I also wrote a hashtable :)

pacoverdi · on Feb 27, 2017

> He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.

   [...] using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits.

He seems to know this bit :)

mark-r · on Feb 27, 2017

He appears to know about bit masking but not bit shifting, which seems odd.

krajzeg · on Feb 27, 2017

There is a lot of material in the article on power-of-two vs non-power-of-two sizes and it definitely includes a discussion of how POT is convenient due to modulo becoming a bitmask (not a bitshift, which would be division).

tveita · on Feb 27, 2017

That bias graph is made assuming you are picking numbers between 0 and 32768. With a 64-bit hash the bias is negligible. With a 32-bit hash I suppose you could argue significance around 100 million items.

I think the text is clear that the reason the hash table supports power of two size tables is to avoid the modulo operation.

anon1385 · on Feb 27, 2017

>The author seems mistaken about a couple things relating to hash tables. He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.

From the article:

>This is the main reason why really fast hash tables usually use powers of two for the size of the array. Then all you have to do is mask off the upper bits, which you can do in one cycle.

hinkley · on Feb 27, 2017

Yeah I'm gonna need an explanation too. Even with hashes hardened against DOS attacks, the lower bits tend to vary with the input far faster than the upper bits.

For instance, several popular hash algorithms (of the form 31 * hash + value) will have the same upper bits for any two strings of the same length, up to about 12 characters in length when the seed value finally gets pushed out. Unless you're still using 32 bit hashes for some reason and then it's most strings under 6 characters long.

Multiply by prime and then add or xor the next value guarantees that the bottom bits are less biased and so fits with any hashtable function that uses modular math, even if the modulus is taken by bit masking. Getting the upper bits to mix would take a different operator or larger operands.

I know I've encountered at least one hash function that uses a much larger prime (16 or more bits) as its multiplier, so it's not unheard of, but it's certainly not universal.

rurban · on Feb 27, 2017

Exactly. It's a bitmask, not a shift.

But he doesn't know about ctz (counting trailing zeros) for fast load/fillrate detection. With 50% it's trivial (just a shift), but for 70% or 90% you'd need to use _builtin_ctz.

hinkley · on Feb 27, 2017

Not that I disagree with you, but a number of these so-called 'prime' hashtable implementations don't actually use prime numbers. They use 'relatively prime' numbers.

If you happened to use 2^n ± 1 you'd have very little bias according to that map, but you wouldn't strictly be using a power of 2.

Unfortunately Java's original hashtable started at 11 and increased using f(m) = 2 * f(m - 1) + 1, giving you 23, 47, 95, 181, etc. Or pretty close to right on the bad parts of the bias curve shown in that link.

Makes me wonder, if Java hashtables had been given a default size of 15 or 7, if this ever would have been fixed...

kwillets · on Feb 27, 2017

The article is specifically about when N is not a PO2.

dgacmu · on Feb 27, 2017

They may, but it's worth noting that if you're going to use Lemire's trick, you must have a good hash function -- which this post avoids. The trick you referenced (i use "trick" in a good sense) relies on the high-order bits of the multiplication changing substantially, which means that items that hash to 1, 2, 3, ... end up in the same bin. This is perfectly fine if your hash function maps uniformly to [0, max_int32_t] (or max_int64_t if you use the umul_hi variant), but it's not fine if you use the horrible default identity function std::hash.

(Speaking from experience: I used his trick to accelerate one of the TensorFlow hash tables a while ago -- https://github.com/tensorflow/tensorflow/commit/a47a30018502... )

TorKlingberg · on Feb 27, 2017

This is fine if you are working with 32-bit numbers on a CPU with fast 64-but multiplication. What if you have 64-bit numbers?

jbapple · on Feb 27, 2017

x86-64 has single instruction (MUL) that produces a 128-bit unsigned result by multiplying two unsigned 64-bit operands.

gpderetta · on Feb 27, 2017

In addition to what jbapple said, if your hash map has more than 2^32 buckets, the latency of the division is the last of your problem.

TorKlingberg · on Feb 27, 2017

Sure N (number of buckets) shouldn't be too big, but x (the input value) could be.

79d697i6fdif · on Feb 27, 2017

realistically any hash table bigger than the 32 bit boundary will be well outside any CPU caches and possibly even ram. At these sizes the memory latency will dwarf CPU speed

EGreg · on Feb 27, 2017

Speaking of avoiding non-binary logic, what do you all think of using the same simplification for doing SYNC:

http://magarshak.com/blog

jlrubin · on Feb 27, 2017

I'd be interested to see a comparison of cuckoo hashing v.s. robin hood on these benchmarks.

I chose to use Cuckoo in my implementation of a high performance fixed-size cache for bitcoin signature validation (see https://github.com/bitcoin/bitcoin/blob/master/src/cuckoocac...). Cuckoo is great in my experience if you want to use memory fully and can afford > 3 high quality hashes per entry. I also like cuckoo because you're able to use it pretty well for lockfree/high concurrency use cases (can make concurrent erases and reads lock free for example), which made it a good fit for the bitcoin workload. It seems that this hashtable's approach isn't quite amenable to concurrent operations, but perhaps there are some tricks that could work for that.

exDM69 · on Feb 27, 2017

Just to clear up a common point of confusion: the "Robin Hood" scheme that is commonly used these days is not a multiple hashing solution like the original research paper from the 1980s (and Cuckoo hash table).

What's used these days should be called "Robin Hood linear probing" or something else. It's just a variant of linear probing where an entry occupying a "better" slot than the one being inserted gets pushed back.

It might still be an interesting comparison but I'm under the impression that Robin Hood is a very good solution for a general purpose hash table.

rurban · on Feb 27, 2017

Cuckoo uses 2x memory, with constant cache trashing. It can be made fast with concurrent maps, but there Cliff Click's or Preshing's Leapfrog approaches are better.

jlrubin · on Feb 28, 2017

What makes you say that cuckoo uses 2x memory?

rurban · on March 1, 2017

The most commonly used variant uses 2 arrays, indexed by the two hash functions. Or one array twice the size.

While the current trend is to reduce the size of this array by using size-compressed indirect indices, e.g. the new ruby/python hashes.

PeterisP · on Feb 27, 2017

I wouldn't ever have thought that replacing the generic integer modulo operation with a switch statement and a modulo operation with a particular hardcoded denominator could possibly cause a performance improvement.

Especially for primes. I can imagine that various power-of-two related values could have some bit-twiddling ways to make them much faster, but assuming that the compiler will be able to improve x mod 711 to something better than using the CPU modulo operation seems weird... but it somehow works.

Sharlin · on Feb 27, 2017

Yep, there's pretty much a trick for every reasonably small divisor. For instance GCC 4.9 translates division by 711 as follows (dividend in edi):

        mov     eax, edi
        mov     edx, -1202107583
        imul    edx 
        mov     eax, edi
        sar     eax, 31
        add     edx, edi
        sar     edx, 9
        sub     edx, eax
        mov     eax, edi
        imul    edx, edx, 711
        sub     eax, edx

This is, uh, pretty deep magic. Apparently this sequence of ops, including those two imuls, is still faster than a single idiv.

PeterisP · on Feb 27, 2017

And don't forget that you still have a switch statement with multiple branching operations that pick out this code among the various options of hash table size!

This is something where function pointers / overloading would likely have even better performance than direct switching - when the hash table size changes, switch out the function from modulo_by_511 to modulo_by_711.

minimax · on Feb 27, 2017

Any reasonably good optimizing compiler will implement that switch statement as a jump table so it's really just one conditional branch even though it doesn't look like it.

rurban · on Feb 27, 2017

But this compiler will add unneeded bounds checking (3-10% perf. hit). A computed goto table would fix that. (on modern compilers).

Anyway, Lemire's fastrange modulo solved that problem better. https://github.com/lemire/fastrange There's no need for a slow modulo.

tehwalrus · on Feb 27, 2017

Is this the trick where you multiply by the two's-compliment inverse instead of dividing? That was the coolest thing I learned when I spent a few days coding assembly.

I can't see the shift that you normally need after pulling the result out of the overflow register though. My asm is a little rusty though.

tomsmeding · on Feb 27, 2017

It doesn't need the shift because it retrieves the result of the first imul from edx, which contains the upper word of the result. The lower word is placed in eax, which this code completely ignores.

tehwalrus · on Feb 27, 2017

Yes but even in the upper register you still need to lshift by one or two bits depending on the constant IIRC. Five had me shifting by (1 or 2), and seven by (2 or 1) I seem to recall. I can't remember what property of the number determines the size of the shift though.

gpderetta · on Feb 27, 2017

IIRC that only work if a%b==0 otherwise the general division trick is more complex.

caf · on Feb 28, 2017

Yes, you are correct - multiplying by the integer inverse only works for dividends that are an integer multiple of the divisor.

flamedoge · on Feb 27, 2017

Is this from r=n-x*711; x=n/711 done via long binary division?

pwdisswordfish · on Feb 27, 2017

Not quite. See https://zneak.github.io/fcd/2017/02/19/divisions.html — it was posted here some time ago.

bodyfour · on Feb 27, 2017

If you're interested in learning more: * Check out libdivide: http://libdivide.com/ * Also, the awesome book "Hacker's Delight" has extensive information about how and why these tricks work. It's the bit-twiddling bible -- probably my favorite programming book I've ever read

csl · on Feb 27, 2017

There seems to be a preprint on that chapter from here: http://www.hackersdelight.org/divcMore.pdf

I recommend the book as well, it's got a lot of really awesome tricks!

torrent-of-ions · on Feb 27, 2017

That's very cool... Some years ago I wrote my own multiple-precision arithmetic library. I implemented a lot of it in assembly so that I could access the carry bit and do full-word multiplication etc. I studied GMP and realised that my addition and subtraction were near enough perfect, but multiplication and division were far off. I now see that just because I was way off with division because I simply implemented the long division algorithm and used IDIV for words (and even that was more difficult than the other algorithms).

MaulingMonkey · on Feb 27, 2017

If you want to see what the magic looks like for yourself:

https://godbolt.org/g/5LZy2t

antirez · on Feb 27, 2017

Related to that, it's incredible how modern CPUs can perform anything in a cycle of clock or less (multiple ops at the same time), but with division. Division is still at ~100 clocks cycles! So it is definitely the first thing to look when optimizing inner loops.

gpderetta · on Feb 27, 2017

careful, a multiply usually has a latency of 3 or 4 clock cycles; the trick is that the multiply unit is fully pipelined and there are multiple units, so you can start a couple of multiplies every clock cycle. The latency is still relevant if you have low ILP.

Division is hard to fully pipeline; IIRC Skylake can start a new division every 6 clock cycles, but the latency for 32 bit integer division is 19 cycles (which is pretty amazing BTW).

antirez · on Feb 28, 2017

True, multiplication latency is a big high, 4 clocks basically in most recent Intel processors:

https://software.intel.com/sites/default/files/managed/9e/bc...

However 64 bit divisions is terrible at ~100 cycles per 64 bit division. It is true that the 32 bit division is much faster, but now that the target is 64 bit, the 64 bit divisions are very popular. It's something to take in mind when doing fast math... when 32 bits are enough and there are unavoidable divisions, use 32 bits integers.

gpderetta · on Feb 27, 2017

The Hacker's Delight has a pretty good treatment of the topic on chapter 9.

But the trick in the link shared by jlrubin is more relevant for hash tables.

kwillets · on Feb 27, 2017

It's dozens of cycles to do a DIV. On Skylake the various DIV instructions range from 23 to 95 cycles. That's longer than a pipeline stall.

In the case of a switch like this I would try something horrible like a template for each prime, so that the code is inlined.

gpderetta · on Feb 27, 2017

Hum, the nice thing with Robin Hood hashing is that you get close to 90% load factor without pathological performance degradation. If you have a fixed probe count, wouldn't you be better just grouping your buckets in fixed size sub buckets and doing (trivially parallelizable) linear searches for lookups and to find empty insert positions?

On a related note, I have in general concerns about latency spikes due to rehashing. Is there a good, general, incremental rehashing scheme?

_benedict · on Feb 27, 2017

Well, if you're willing to incur a slight latency penalty on each access, incremental rehashing is terrifically easy. Just keep both the old and new backing arrays around, and consult a pointer to determine how far in the first backing array has been invalidated.

If you follow a Shalev/Shavit hash-ordered-lists scheme the rehash doesn't even need to move anything, and can be done with a straight memcpy, or in a separate thread. The link below is an implementation that was intended for Cassandra (so is simplified and append-only, but also a concurrent map - which could be relaxed). It has the nice property that it will "fix" the backing array on-demand, so straight-up copying the existing contents of the backing array into the new region of the backing array, once doubled, will set things up for a gradual transition.

Similarly, it's trivial for the backing array to instead grow incrementally, and for an extra level of indirection to prevent ever having to reallocate older regions (since their contents never changes on rehash).

https://github.com/belliottsmith/cassandra/blob/47d970ead690...

eternalban · on Feb 27, 2017

Thanks for that SS list reference!

Paper link: http://people.csail.mit.edu/shanir/publications/Split-Ordere...

gpderetta · on Feb 27, 2017

keeping the old array around is the obvious solution, but the problem is that you might need to rehash the second array as well. Do you need to keep an arbitrary number of arrays around or there are hashing schemes that guarantee that you'll be done with rehashing by the time you need to rehash the second buffer? I'll read up on hash-ordered-lists.

_benedict · on Feb 27, 2017

Well, depending on your rehash trigger (i.e. if it is dictated by the load factor, not the number of reprobes), it can be guaranteed that the new array does not need to be rehashed before the in-progress migration, if you migrate at least one bucket per write operation. If you were to (sensibly) migrate many (say, >= 16) with each write, the likelihood of this happening is very low with any reasonable hash function, since the work of rehashing will typically be done many times faster than the skew can retrigger your rehash. This at least makes the incidence of a latency-inducing event of waiting for the prior rehash to complete very low.

If you rehash based on something like "number of reprobes" then it's probably impossible to absolutely guarantee that a rehash cannot be triggered before a prior one begins, but the linked approach would permit growing the backing array discontiguously (assuming an extra level of indirection to the backing array). But this would obviously incur some extra overheads, and is not entirely dissimilar to hash-tries, which also avoid any costly global rehash.

I think the class-level comment in the code I linked is perhaps the best introduction to split-ordered lists I know of (not hash-ordered; they're sorted by the reverse bitstring of the hash), since they're not widely discussed and not immediately intuitive.

leventov · on Feb 27, 2017

There is a good general segmented hashtable algorithm with ~100 times smaller rehash spikes than in classic hash tables: https://github.com/OpenHFT/SmoothieMap/blob/master/src/main/...

The idea is not super clever so I'm sure many people found & implemented it in papers and other languages, but I didn't find them. Will be grateful if someone points me to prior work.

throwawayish · on Feb 27, 2017

> Is there a good, general, incremental rehashing scheme?

Trees.

gpderetta · on Feb 27, 2017

As a C++ programmer, my default dictionary is usually an std::map (which is just an RB-tree), so, yes agree 100%.

But the amortized O(1) lookup of hash tables is hard to beat.

edit: s/bass/beat/

throwawayish · on Feb 27, 2017

> But the amortized O(1) lookup of hash tables is hard to beat.

On the other hand: the performance characteristics of trees are much easier to predict.

And in many cases a HT lookup isn't distinguishable from a tree lookup, at least if it's a dedicated application data structure, like some index or so.

I feel like HTs are very good for smaller structures (excellent example: dictionary/hashtable-based interpreters), but for larger data structures (e.g.: indices) ... not so much.

malisper · on Feb 28, 2017

> On a related note, I have in general concerns about latency spikes due to rehashing. Is there a good, general, incremental rehashing scheme?

One option is to move a few elements over from the old hash table on each operation[0].

A second option is to use linear hashing[1] (different from linear probing). Linear hashing let's you increase the size of your hash table by one bucket at a time. The general idea is once your hash table is full, you allocate a new bucket (generally from a preallocated array) and move half of the elements from an old bucket to the new bucket. I know Postgres used to use linear hashing, but I'm not sure about the current state. I heard they started using Robin Hood hashing, but I haven't looked at whether that is unrelated to the code that uses linear hashing.

mirekrusin · on Feb 27, 2017

Fractal Trees have interesting gradual "rehashing".

nialv7 · on Feb 27, 2017

Growing hashtable based on number of unsuccessful probes is essentially the same as grow based on load ratio (α).

Because the expected number of probes can be calculated from it, which is 1 + 1/(1 − α).

_benedict · on Feb 27, 2017

Well, you're assuming a perfect uniform distribution of your values. Skew in data is common, since cryptographic hash functions are uncommon (since they're expensive), and any way we generally store collections vanishingly smaller than infinity, in which case even a perfect distribution will produce clusters with varying frequency; and each will degrade differently under these situations.

At the extreme, if every item produces the same hash code, growing on reprobes will result in exponential insert complexity (since every insert will trigger a grow). Growing on load factor will just maintain O(n) insert/lookup complexity (or the complexity of the hash bucket, if not open-addressing).

With less extreme skew, growing on load factor will simply use less memory than growing on reprobes, while reprobes will have a slightly lower median (and possibly mean) insert/lookup costs.

amelius · on Feb 27, 2017

What objects were actually stored in the hash table for these tests?

Are they fixed size, non-heap-allocated?

And would the performance improvement still be relevant if this is not the case, e.g. for variable size strings?

ZoFreX · on Feb 27, 2017

Some of the hash tables were heap allocating others were in place. They tested a variety of sizes.

titzer · on Feb 27, 2017

Integer modulus is one of the slowest CPU operations, especially on modern CPUs (since most other operations have been heavily optimized, and there are many execution units available for them). IDIV on Nehalem can literally take up to 100 cycles (http://www.agner.org/optimize/instruction_tables.pdf), and it varies _a lot_.

Power of 2 hashtables are so much faster because indexing uses a mask. The problem is that, as others have mentioned, powers of 2 suffer a lot with bad hash functions, and people tend to write pretty bad ones, so it's generally a good idea to salt or rehash hashcodes before dropping the upper bits with a mask.

In fact it looks like the results from the author's bear that out, as the flat_hashmap_powers_of_two is almost always the fastest.

Dowwie · on Feb 28, 2017

Raymond Hettinger made an interesting talk about Python's dict (hash table) and how Python's core development team evolved it over time. He's giving the polished version of talk at PyCon this May but if you can't wait for it, here's a rough version: https://www.youtube.com/watch?v=p33CVV29OG8

silok · on Feb 28, 2017

Have you compared it to this implementation? http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-...

charisma_stupid · on Feb 27, 2017

This is good work. I wish it was extended to work under multiple threads so it could be used in databases.

markcerqueira · on Feb 27, 2017

A lesson in "Tone is everything" perhaps? You make excellent points and taught me a bunch (thanks!), but just because the author admits he doesn't understand hash tables inside and out does not make calling his work "amateurish" or "clowntown" okay.

dang · on Feb 28, 2017

We detached this subthread from https://news.ycombinator.com/item?id=13746260 and marked it off-topic.

ot · on Feb 27, 2017

If you read carefully you'll see that "clowntown" referred to GCC's implementation of std::hash, not the post. I think it's natural to have high standards for GCC.

I don't think my tone was too aggressive, maybe made exception for the word "amateurish"? I explained above what I meant, if you replace that word with a better one does it change the tone of the post?

astine · on Feb 27, 2017

"if you replace that word with a better one does it change the tone of the post?"

Yes, it does.

'Amateurish' is often used as an insult to imply that someone needn't have tried at all. If you replaced it with 'novice,' I think people would have understood your meaning without taking offense.

ot · on Feb 27, 2017

I see. I didn't know "amateurish" had such negative connotations. Unfortunately HN doesn't let me edit the post anymore.

notJim · on Feb 27, 2017

Since we're on the subject, reading the "clowntown" thing makes Facebook's tech culture sound kinda toxic. It's one thing to criticize a bad implementation, and a different thing to call the implementer a clown.

ot · on Feb 27, 2017

I might have given the wrong impression, see my other comment below. "Clowntown" is almost always directed at ourselves, in a self-mocking, playful way.

Also see answers to this Quora question: https://www.quora.com/When-people-who-work-at-Facebook-say-c...

> It's a way to bring much-needed levity to the process of solving hard problems.

> It's a playful, tongue-in-cheek, self-deprecating term to describe a situation in a company whose motto was "move fast and break things."

notJim · on Feb 28, 2017

Ahh, that does sound better.

q3r3qr3q · on Feb 27, 2017

Your tone was fine. Don't pander to these people, please.

mamp · on Feb 27, 2017

For a post on HN the tone was ok but not great, if one of my team spoke like that in a meeting I would commend them on their technical knowledge but remind them that their effectiveness and influence depends more on tone than technical skill, because people.

dang · on Feb 28, 2017

ot was being a gentleman, not pandering. It's unfortunate that the subthread went off topic but the reasons are understandable and the disagreement resolved nicely. Please don't be rude in comments here.

hubert123 · on Feb 28, 2017

Stop policing people's words, what about your tone in all this? "here is a lesson in tone for you" is not exactly nice tone either but since you are the police it doesnt matter right?

dang · on Feb 28, 2017

If you continue to post uncivil comments to HN, we will ban your account. We've asked you many times before to stop.

on Feb 27, 2017

[dead]

dang · on Feb 28, 2017

We've banned this account for trolling. Please don't create accounts to break HN's guidelines with.

We detached this subthread from https://news.ycombinator.com/item?id=13745383 and marked it off-topic.

ksk · on Feb 27, 2017

FB can have brilliant world-class engineers working on task A producing quality output, while others working on task B can over-engineer and over-abstract, producing crapola. Replace FB with Google, Apple, Microsoft, Oracle, Sun, IBM, Adobe, etc, etc.

ot · on Feb 27, 2017

That word is more often than not directed at our own work. Maintaining a self-mocking attitude can help in being more receptive to valid criticism.

cgag · on Feb 27, 2017

I hear they've got a clowntown mobile app (I don't use it) but react, flow, graphql, relay, and their Haskell spam filtering / ghc work all seems pretty great.

rudolf0 · on Feb 27, 2017

Their non-mobile quality seems to be pretty high as far as I can tell, but, yeah, I feel like he'd have to agree their iOS app is the definition of clowntown.

Ono-Sendai · on Feb 27, 2017

Well I think my hashtable is probably faster.

It's also faster than dense_hash_map.

The author claiming their hash table is the fastest ever is kind of ridiculous.

elmigranto · on Feb 27, 2017

It's fastest of all author's implementations, which is stated in first paragraph.

dsp1234 · on Feb 27, 2017

The first paragraph states very clearly:

"And by that I mean that I have the fastest lookups of any hashtable I could find, while my inserts and erases are also really fast. (but not the fastest)"

Yes, it is the fastest that author wrote. However, that's not the sole claim. But also that it has the fastest lookups of any hashtable the author could find.

The complaint being that he did not do a very thorough job of finding alternatives, and thus the title of "I wrote the fastest hashtable" is not reasonable.

aka "I'm the smartest/fastest/strongest person of the people I could find"

Ono-Sendai · on Feb 27, 2017

Yes but the title make a claim that it's the fastest.

elmigranto · on Feb 27, 2017

And it is, though, not among a set of hashtable implementations you assumed.

It is quite similar with how you wrote "fastest" instead "fastest among all existing publicly available implementations". Nothing wrong with that, since it is obvious from context, very much the same way it is obvious from article what author means.

attractivechaos · on Feb 27, 2017

In the text, it is fine to say "fastest" because there is a context, but in the title, "fastest" means above everything else in the world because the context has not been established yet to readers. If you read news with "fastest" in the title (e.g. fastest CPU or fastest compiler), it really means the fastest in the world at the time of publication.

The author could just say "very fast" or "faster than common hash table libraries" in the title. The current title, IMHO, undermines the quality of the post (which is actually good).

cat199 · on Feb 27, 2017

But I'm the context! Geez.

attractivechaos · on Feb 27, 2017

The title of this post has been changed to "my fastest" after I wrote my previous comment. The original post is still "the fastest". There was no context.

Ono-Sendai · on Feb 27, 2017

it's a bullshit clickbait title.

saalweachter · on Feb 27, 2017

As the author explains right off the bat, it's a logical clickbait title. He wrote three blog posts: "I wrote a fast hashtable", "I wrote a faster hashtable", "I wrote the fastest hashtable".

I mean, I'd still call it clickbait, but it's hardly bullshit clickbait.

cat199 · on Feb 27, 2017

there is a difference?

Beltiras · on Feb 27, 2017

I'm looking forward to your writeup.