Writing a Very Fast Hash Table with Tiny Memory Footprints

pubby · on May 9, 2017

Not to disgrace the author (it's a good article), but I've spent a lot of time studying hashtable performance recently and I groan whenever I see benchmarks against libraries like std::unordered_map or libcuckoo. No shit your hash table performs better; it's like claiming you're good at running because you can beat a crippled child in a race.

For a test to be meaningful it needs to include comparisons to linear probing (both regular, robin-hood variants, and SSE variants), and other low-cost hash tables like coalesced hashing or separate chaining with a memory pool.

Speaking of robin hood hashing, pretty much all of the existing articles about it are missing some crucial facts about it. Don't take what blog authors say for granted.

johnfn · on May 9, 2017

> pretty much all of the existing articles about it are missing some crucial facts about it.

Would you mind expanding on those facts?

pubby · on May 9, 2017

1) The linearly-probed robin hood variant is just a sorted array. That's it. It's just a sorted array. I haven't seen an article explain it so succinctly though.

2) The linearly-probed robin hood variant is only faster than regular linear probing for searches that fail. Successful searches have the same average probe count. Too many articles bullshit about how robin hood reduces variance (which is irrelevant 99% of the time) and then go on to ignore the fact that they cannot beat the performance of linear probing when it comes to successful searches. One guy I saw spent hours trying to optimize his hash table for the K-nucleotide benchmark without realizing that a trivial 5-minute implementation of linear probing would beat it because searches in that benchmark never fail.

3) There's a simple 40 year old bitwise trick by Knuth which allows one to combine the probe distance and the hash into a single value. This saves up to 4 bytes per table entry, but I haven't seen anyone use it.

There's also a few articles that get deletion straight-up wrong, but meh. That doesn't bother me too much.

Also, for double hashing I'm pretty sure a SSE/AVX variant is optimal. I haven't tried implementing one though.

frankmcsherry · on May 9, 2017

> 1) The linearly-probed robin hood variant is just a sorted array. That's it. It's just a sorted array. I haven't seen an article explain it so succinctly though.

It's not. Importantly, it leaves spaces so that you avoid what would be ~sqrt(n) probe lengths had you tightly packed everything. Also, it is a bit out of order as you wrap around the limits of the array (less of a big deal).

The sortedness property is really useful, btw. A RHH lets you get the perf of a good hash map for random access, and the throughput of a sorted list for merges / bulk processing.

> 3) There's a simple 40 year old bitwise trick by Knuth which allows one to combine the probe distance and the hash into a single value. This saves up to 4 bytes per table entry, but I haven't seen anyone use it.

When the values are sorted (modulo empties and wrapping) you don't need to store the probe length. You just maintain the invariant that things are sorted. Maybe I'm missing some detail of your implementation, but mine sure don't bother storing it.

pubby · on May 9, 2017

>When the values are sorted (modulo empties and wrapping) you don't need to store the probe length. You just maintain the invariant that things are sorted. Maybe I'm missing some detail of your implementation, but mine sure don't bother storing it.

Storing the probe count allows the lookup function to have 2 branches per loop as opposed to 3. The probe count also contains enough information to distinguish occupied buckets from unoccupied ones. I'm convinced that combining the probe count with the hash is objectively the best way to implement such tables.

frankmcsherry · on May 9, 2017

Can I convince you to downgrade to being subjectively convinced?

Storing hashes or probes wasn't faster for the use cases I evaluated. I can see reasons why you might like it, for your uses and perhaps "generally", but not in my case at least (small keys where storage bloat was worse than `fnv` recompute, and values which had invalid states which can be used to represent empties)

Edit: I'm re-reading my first response and it reads like "you never need/want the probe length", and I totally retract that position; it wasn't intended.

dryman · on May 9, 2017

1) I don't see why a linear probing robin hood is a sorted array? Can you explain more?

Robin hood hashing doesn't limit which probing scheme you use. I end up with quadratic probing which gives me both good cache locality and good probe distributions. The probing schemes I tried was omitted in this post because it would bring too much noise. But I can give you some quick summary here:

1. linear probing: probing distribution has high medium and high variance. Performance is not that great either because of the high probing numbers.

2. quadratic probing: probing distribution is not the best, but the medium sicks to 2 probes. Since the first two probes are very close in quadratic probing, its overall performance wins.

3. When probing, hash the key with the probe added to the seed. This gives very good hash distribution, but hash on each probe is very slow. Also you cannot do deletion using this scheme.

4. rotate(key, probe). This is somewhat like rehash, but way faster. The probe distribution is also very good, but items goes too far away so we lost the cache locality.

5. Combination of different schemes listed above. Still, the quadratic probing gives me best performance.

I also tried to use gcc/clang vector extension to speed up probing, but it actually slows down for 30%! I guess I have to hand tune the SSE intrinsics and measure it carefully with IACA to get the optimal performance.

Deletion itself is quite complicated and deserves its own post.

pubby · on May 9, 2017

I was actually working on a blog post about this a few months ago, but never published it because I wasn't happy with it. Here it is, anyway: https://goo.gl/W1KZ2t

What I found in testing was that linear probing beat everything when it had a sufficiently good hash function and a sufficiently low load factor. The load factor was pretty important. Even when factoring in cache misses, page faults, and the cost of allocations, a linearly probed table with a low load factor always beat more space-efficient designs.

>I also tried to use gcc/clang vector extension to speed up probing, but it actually slows down for 30%! I guess I have to hand tune the SSE intrinsics and measure it carefully with IACA to get the optimal performance.

They're only advantageous when the hash is perfect and the data is in the cache. Otherwise the performance should be the same as regular code. To clarify what I meant by SSE with double hashing, I really meant search an entire cacheline before incrementing by the second hash value. SSE can be used for this in some cases but regular code works just as well.

>Deletion itself is quite complicated and deserves its own post.

That's why I prefer the linear variant; deletion is easy!

valarauca1 · on May 9, 2017

It's a sorted array because your sorting by 2 keys (A,B). A is the hash, normally a u32 or u64. B is the probe count, also normally the same sized b/c system ints are easy.

Strictly speaking the Robin Hood "sort" isn't purely lexiconally ordered. A larger A value maybe replaced a smaller A, with a much larger B.

But this relation nonetheless is just a weird solution to build a cmp function one. One that arguably doesn't work in a strict sense. But one that _nearly_ works.

dastbe · on May 9, 2017

So if A is the hashed-to index into your table, then B is a function of A. When you are doing your linear probe and you see a value along with its probe count, it either

1. Has a higher probe count, this means it's hash is further away than yours => less than yours

2. Has a lower probe count than yours. This means its hash is closer than yours => greater than yours

3. Is equal in probe count to yours. That means it's in the same bucket => equal to yours.

dryman · on May 9, 2017

When I first see "sorted array", my first impression is it is sorted by the "original key". If we look for the hashed value in hash table, all hash tables are "sorted array" with this definition.

Also there's a finial mod in linear probing h(k, i) = (k + i) mod N. With this mod you may have different key/probe combination that messed up the order.

pkhuong · on May 9, 2017

> 1) The linearly-probed robin hood variant is just a sorted array. That's it. It's just a sorted array.

Right. I stumbled on linear probing robin hood hashing as a simplification of Bender, Demaine and Farach-Colton's Packed memory array (http://erikdemaine.org/papers/FOCS2000b/paper.pdf) when inserts are randomised.

> 2) The linearly-probed robin hood variant is only faster than regular linear probing for searches that fail.

The sortedness means Robin Hood should not only be faster in the worst case (that's what reducing variance is all about), even for successful searches, but can also be faster on average, at high enough loads: make sure empty cells compare as greater than any legitimate value, track the maximum displacement, and use binary search (or just use exponential search).

dgacmu · on May 9, 2017

One nit, as one of the authors of libcuckoo: Of the tested tables, libcuckoo is the only one that's a concurrent hash table. I'm glad that without modification, our single-threaded performance is still doing well, but this isn't the primary use case of libcuckoo.

If you want to compare it against a thread-unsafe table, it's a bit more honest to use the locked_table class, which grants ownership to all locks in the table in advance. It's a fair bit faster.

It doesn't change the memory use - I've created a stripped-down version in the past that elides all the locks, which is better for memory efficiency - but that's not really our goal.

For more on using the locked_table interface, see: https://github.com/efficient/libcuckoo/blob/master/tests/uni...

(Also, for hashing strings, please make sure you've got libcuckoo configured to use cityhash for string hashing, as you had "opic", to make the comparison equivalent.)

In fairness to the OP, however, it's worth noting that this is a pretty new addition to libcuckoo (February), and we haven't documented or pushed it hard.

dryman · on May 9, 2017

Thanks for pointing out the thread-unsafe alternative. I'll update the benchmark with it and double I configured it with cityhash. (will update in a day or two).

Speaking of cityhash, have you tried farmhash as well? I'm not sure what I did wrong, but farmhash's performance wasn't good as cityhash in my hash table. Did you experience the same problem?

dgacmu · on May 9, 2017

Thanks! Appreciate the responsiveness. :)

We haven't - I had some other issues with Farm. I tried for a while to see if I could get FarmHash included as the standard hash for TensorFlow instead of the murmur-like homebrewed one it uses, but the cross-platform build issues became too complicated to justify it. We ended up sticking with the murmur-like one for "fast but unsafe", and providing SIP via the highwayhash package as the "strong hash".

lorenzhs · on May 9, 2017

If you're interested in truly space-efficient hash tables—as in "at every point in time, at least 97% of the space of the table is used for elements"—check out this paper by a colleague of mine: https://arxiv.org/pdf/1705.00997.pdf (the relevant plots are on pages 10 and 11). He's going to publish the code soon-ish as well.

dryman · on May 9, 2017

In my experiments using linear probing in robin hood doesn't give good results as well. I hope the author has tried robin hood with quadratic probing. But it's a good reference. I didn't dive deep in the cuckoo route. This paper gives me more confidence to try it out.

dgacmu · on May 9, 2017

If you're curious about blending some of the "don't increase by a power of two" memory efficiency gains, I implemented this in a cuckoo hash table for TensorFlow, also using Lemire's trick: https://github.com/tensorflow/tensorflow/blob/master/tensorf...

I never back-ported it to libcuckoo, because it adds a fair bit of complexity (and requires an extra bit), but if there's demand for non-2^k tables, I'd love to know more, either externally or internally. (dga@, for various values of cs.cmu.edu and google)

dryman · on May 9, 2017

Lemire's fast range works perfectly with cuckoo, as it doesn't require probing (?). The fast mod and scale trick was invented to address this issue.

It's actually pretty funny. I didn't know Lemire's trick until I start to write this article. But I my first try on fixed point arithmetic was exactly the same as his work. Then figured out the lower bits would get omitted with fast range (I named it scaling). Finally I came up with this fast mod and scale idea.

I don't know for other peoples need on non power of 2 tables. My target is to build large scale hash tables and other data structures. The memory and data compactness is quite critical for this goal.

skolos · on May 9, 2017

I see often discussions about hash maps on HN. However they mostly focus on performance (memory and speed) of one huge hash map with millions of records. In my numerical calculations I often need lots of small hash maps (say up to 100 elements). Currently I use unordered_maps from C++ standard library. Does anyone know what is recommended hash map implementation for my scenario?

scottlamb · on May 9, 2017

I'd expect the same hash map implementations to work well in both scenarios (many hash maps with few entries and one hash map with many entries). I'd try dense_hash_map, and if you're feeling experimental, this new thing.

A more significant factor may be sizeof(pair<Key, Value>). std::unordered_map has a separate allocation for each entry, which is quite horrible with small values in terms of memory usage and CPU cache effectiveness. With larger values, keep in mind that because of the load factor of these hashes, you end up with more buckets than you have entries. It may be better to have that pointer indirection (either through std::unordered_map or maybe dense_hash_map<Key, std::unique_ptr<Value>>) than have the empty buckets be sizeof(pair<Key, Value>). Or if you can find something like Rust's ordermap which handles that indirection with a 32-bit index (i.e., half the size of a pointer on 64-bit platforms) into an array to be more memory-efficient. Having them be adjacent in the array

Another factor might be the cost of hashing each value. If it's cheap (ints or the like), your ideal hash map class probably doesn't store the hash values. If it's expensive (long strings, for example) and you have the memory to burn, your ideal hash map class probably does store them, so that collisions cost less on lookup and so that resizing on insert costs less.

As an alternative, if you're just holding ints or something, you might be better off with a sorted std::vector<Entry>.

dryman · on May 9, 2017

Hi I'm the author for this post. My general guideline is figure out where your bottleneck is, and optimize on the hotspot first. In your description, I can only try my best to guess where the bottleneck could possibly be...

First of all: "I often need lots of small hash maps". Do you maintain multiple small hash maps at the same time? Then the data structure holding this many hash maps may be your bottleneck instead of the hash map itself. Another possibility is you have some small hash map that get created and dropped rapidly. In this case you might want to hold it in memory so you don't have much allocation overhead.

Other than these, the remaining optimization I can think of would be making your hash table compact (like what I did in this post). Reserving buckets with max load factor should give you a nice control of your hash table size. Reference: http://www.cplusplus.com/reference/unordered_map/unordered_m...

nom · on May 9, 2017

For very small number of elements, an array or a list can be more efficient than hash maps. My rule of thumb is to avoid hash maps if you know you'll never have more than 100 entries, and then try to find the bottleneck later on. Using hash maps for everything isn't always the best solution.

oconnor0 · on May 9, 2017

Using a sequence instead of a map is fine if you know it will never be more than some small number of entries, but so often we are wrong about how many entries will be in a table that the N-squared use of a sequence as a map kills performance unexpectedly.

adrianN · on May 9, 2017

But that's what profiling is for.

arjie · on May 9, 2017

That argument favours hashmaps by default, though. After all, they're all right at small sizes and much better at big sizes.

usefulcat · on May 9, 2017

Profiling won't tell you what the chances are that the collection will someday need to contain more than N elements.

adrianN · on May 9, 2017

You don't replace every map in your code with a distributed Key-Value store, in case it might need to store terabytes in the future. Most of the time you have a pretty good idea whether a map has to store a hundred, or a hundred thousand elements.

Veedrac · on May 10, 2017

If you're looking for elements by identity, that's going to be really slow. Hash maps are designed precisely to handle that scenario, and devolving to linear scans is only going to make things worse.

alexkindel · on May 9, 2017

Just keep doing what you're doing with creating a lot of small ones. Design simplicity and consistency is more important unless you find that these hash tables are bottlenecking you (which is extremely unlikely).

resist_futility · on May 9, 2017

If you mostly do lookups then guaranteeing a perfect hashing function for the data set will improve performance. What are you looking to improve upon?

glandium · on May 9, 2017

I only skimmed over, but about memory overhead:

> The input is 8 M key-value pairs; size of each key is 6 bytes and size of each value is 8 bytes. The lower bound memory usage is (6+8)⋅2^23= 117MB

In many case, hash table implementation won't (can't) assume fixed size of keys and values, and use pointers. On 64-bit architectures, this can mean there's an uncompressible overhead of 8 * 2 = 16 bytes (one pointer for each key and value) for each item.

In fact, a quick look at the OP's benchmark code shows he's using std::string as keys and uint64 as values. With e.g. std::unordered_map, while pointers won't be used for the values, there obviously will be pointers for the key.

It's actually worse than pointers, since the std::string memory layout is a size_t and a char* pointer. And it looks like his hash table essentially uses the equivalent of char[6] as keys. So a fairer memory overhead comparison should use char[6] as keys, instead of std::string...

jsnell · on May 9, 2017

Small string optimization should kick in here, so you're probably not looking at a size_t + pointer + a object with the 6 bytes of payload data. Instead it'd be a single object of about 32 bytes (depending on the implementation).

But yes, this means that the author's own hash table implementation is getting a roughly 4-5x density benefit from using a different string representation. That totally invalidates the memory usage testing, and probably also has a major effect on the speed tests.

glandium · on May 9, 2017

Thank you for making me look at the implementation of std::string in GCC's libstdc++... The "small string optimization" indeed makes every std::string at least 1 size_t + 1 pointer + 16 bytes large (for a total of 32 bytes on 64-bits platforms).

Strings smaller than 16 bytes are stored in the 16 bytes inline buffer, which means strings larger than that actually take 32 + their length + 1 bytes in memory on 64-bits platforms.

I guess things are not packed further to allow to just directly use the pointer instead of having to do some computation for each access.

jsnell · on May 9, 2017

> I guess things are not packed further to allow to just directly use the pointer instead of having to do some computation for each access.

There are C++ standard libraries that pack things further. I think clang has a 24 byte std::string, which can store strings of up to 22 bytes inline.

dryman · on May 9, 2017

Thanks for pointing out. I used std::string mainly because I suck at C++ :(. In C I can easily define the key length to be a variable passed to the constructor, but I don't know how to do the same with C++.

faitswulff · on May 9, 2017

This note about his other project (that the robin hood hash table depends on) is quite curious:

> This robin hood hashing is implemented using my project Object Persistence In C (OPIC). OPIC is a new general serialization framework I just released. Any in-memory object created with OPIC can be serialized without knowing how it was structured. Deserializing objects from OPIC only requires one mmap syscall. That’s say, this robin hood implementation can work not only in a living process, the data it stored can be used as a key-value store after the process exits.

> Right now, the throughput of OPIC robin hood hash map on small keys (6bytes) is 9M (1048576/0.115454). This is way better than most NoSQL key-value stores. The difference might come from write ahead logs or some other IO? I’m not sure why the performance gain is so huge. My next stop is to benchmark against other embedded key-value store like rocksdb, leveldb and so forth.

dryman · on May 9, 2017

Thanks for raising this up. I spent way way more time on OPIC (almost a year) than robin hood hash map (2 weeks to complete the POC). Glad to see people noticing this project :)

The other comment pointed out MDBM, which I didn't know about. From their performance number I think this may show that why OPIC robin hood is quite optimal. https://yahooeng.tumblr.com/post/104861108931/mdbm-high-spee...

MDMB gives users raw access to the mmaped data and pointers. And from its benchmarks it results 10x faster than rocksdb and leveldb. The design of OPIC has even less overhead (may not be a good thing) than MDBM, and it also works on a mmaped file (or anonymous swap). There's no lock, transaction, or WAL in OPIC. OPIC just brings you the raw performance a hash table can gives you.

luckydude · on May 9, 2017

Look up MDBM. I spent a lot of time with a logic analyzer watching the cache misses go across the bus. I'd be pretty surprised if someone has done better.

I can find the code and repost it.

searchfaster · on May 9, 2017

> I can find the code and repost it.

Here it is.. https://github.com/yahoo/mdbm

luckydude · on May 14, 2017

So Yahoo took the code and did all sorts of stuff to it. I've got the code pre-yahoo and that's the stuff that is fast.

It's part of BitKeeper which is open source.

dryman · on May 9, 2017

I found some performance numbers from Yahoo. https://yahooeng.tumblr.com/post/104861108931/mdbm-high-spee... It has random read time 0.45 μs. On my late 2013 iMac, i5 cpu, I got 0.089 μs random read. I yet to have a benchmark that measures sequential read, but the API supports it.

dryman · on May 9, 2017

Thanks for the reference. I'm collecting a list of embedded key-value store to benchmark against. I'll tryout with this one first!

corysama · on May 9, 2017

Fun note: Because of this line in the article

> Because MDBM gives you raw pointers into the DB’s data, you have to be very careful about making sure you don’t have array over-runs, invalid pointer access, or the like.

Someone wrote a Rust binding http://erickt.github.io/blog/2014/12/13/rust-and-mdbm/

dryman · on May 9, 2017

Wow that's a very interesting post! Maybe I should try to bring a rust binding to my project as well..

posnet · on May 9, 2017

https://github.com/pmwkaa/engine.so

dom0 · on May 9, 2017

hyc_symas · on May 15, 2017

I would expect this hash table to be much faster than LMDB for tables that fit in RAM. It doesn't have transactions or any kind of concurrent access guarantees so naturally it has lower access overhead. For data sets larger than RAM, or in multithreaded environments, it will be pretty unusable.

tawayway · on May 9, 2017

Out of curiosity, what kind of logic analyzer are you using that can see events that fast?

luckydude · on May 9, 2017

I dunno, it was a long time ago at SGI, maybe 23 years ago? The CPUs were 200mhz MIPS chips. One of the hardware guys heard me talking about trying to optimize this code and he set up the analyzer and taught me how to use it.

wolfgarbe · on May 10, 2017

How does it compare to the "Fastest Hashtable" by Malte Skarupke? ( https://probablydance.com/2017/02/26/i-wrote-the-fastest-has... )