Cache tables

antirez · on Feb 13, 2019

I use this technique often and I love the simplicity it has. If you want to see a real-world usage, there is one in https://github.com/antirez/dump1090:

    /* Add the specified entry to the cache of recently seen ICAO addresses.
     * Note that we also add a timestamp so that we can make sure that the
     * entry is only valid for MODES_ICAO_CACHE_TTL seconds. */
    void addRecentlySeenICAOAddr(uint32_t addr) {
        uint32_t h = ICAOCacheHashAddress(addr);
        Modes.icao_cache[h*2] = addr;
        Modes.icao_cache[h*2+1] = (uint32_t) time(NULL);
    }

    /* Returns 1 if the specified ICAO address was seen in a DF format with
     * proper checksum (not xored with address) no more than * MODES_ICAO_CACHE_TTL
     * seconds ago. Otherwise returns 0. */
    int ICAOAddressWasRecentlySeen(uint32_t addr) {
        uint32_t h = ICAOCacheHashAddress(addr);
        uint32_t a = Modes.icao_cache[h*2];
        uint32_t t = Modes.icao_cache[h*2+1];

        return a && a == addr && time(NULL)-t <= MODES_ICAO_CACHE_TTL;
    }

As you can see adding a timestamp you can give a TTL to the cached entry validity without any additional effort: in that case you don't need to purge the hash table of old entries eventually, because the cache table is fixed size and the entires will get replaced.

jnordwick · on Feb 13, 2019

I don't really see how this applies to the post, or am I missing something?

The linked article is about having multiple entries per open indexed hash entry, and sizing them to take advantage of cache line sizes.

Your code seems like it basically turns each open index entry into a struct of hash code and timestamp and then compares the timestamp on lookup. This just isn't clear because of how you store the timestamp in odd table offsets after shifting the bottom bit off the hash.

You could take your hash code / timestamp pair and use the articles cache line packing together really (at the cost of half the line).

antirez · on Feb 13, 2019

I think that the main point of the blog post, is that instead of an exact match in O(N) worst time, you change the algorithm in order to replace the entry when adding, so that the lookup time also is always O(1) and the size is bound. The cacheline thing is just an optimization in order get the best out of the single cache miss you incurred. The table offset I use is the manual version of what you would do with a struct, no differences in the generated code more or less: the entry and the timestamp are stored at sequential addresses.

jnordwick · on Feb 13, 2019

I can see that a little. It is a good article regardless what you drew from it. I found it more of a lossy set membership test that used packed cache lines to help in performance (over chaining), but I don't really know what his use case was. Good write up none the less.

CoolGuySteve · on Feb 13, 2019

Is there a reason why you store at h*2 instead of making icao_cache an array of "struct { uint32_t a; uint32_t t; }"?

antirez · on Feb 13, 2019

Nope, I guess I was in an assembly mood :-D

jorangreef · on Feb 13, 2019

The comparison between the classes of "Las Vegas" and "Monte Carlo" algorithms is a useful distinction, along with the author's suggestion of using a cache table as a lookup table for compression match finding.

If anyone is interested, I wrote @ronomon/hash-table as a fast, scalable cuckoo hash table for Node.js.

It has high load factor, constant lookup time in the worst-case and uses a non-recursive variant of the cuckoo probing scheme, together with a bloom filter to reduce cache misses to the alternative cuckoo bucket by an order of magnitude.

Last but not least, caching is supported according to the CLOCK LRU algorithm, which is elegant and effective.

Benchmarks and design decisions are in the README:

https://github.com/ronomon/hash-table

CoolGuySteve · on Feb 13, 2019

If the number of searches for each key isn’t uniformly random, then instead of random replacement, you can bubble sort the bucket values into an LRU eviction scheme. Each time you do a lookup that succceds and the element in the hash entry isn’t index 0, swap it with index - 1. Replacement should always be done on the last element in the bucket vector.

This has 2 advantages:

- scanning the bucket aggregates to theta(1) if your searches follow the Pareto law

- popular keys are less likely to be evicted

Because your buckets are more relevant, you can also make them smaller, reducing the k time in an O(nk) complexity over n searches and decreasing the amount of cache thrashing.

jandrewrogers · on Feb 13, 2019

A concern with using LRU eviction is that the behavior is sensitive to how accesses are distributed over time. The useful property of random eviction versus LRU eviction is that performance falls within a narrow bound: best case scenario is worse, but worst case scenario is better. I think this use case is valuing predictability.

The main reason most databases use one of the myriad clock sweep eviction variants instead of true LRU eviction is that they ensure more uniform performance over all possible sequences of a given non-uniform distribution of accesses -- most of the upside of an LRU while (especially in clever variants) avoiding most of the downside as well.

jorangreef · on Feb 13, 2019

If I understand you correctly, CLOCK LRU works like this, except without the memcpy swapping cost.

https://en.wikipedia.org/wiki/Page_replacement_algorithm#Clo...

CoolGuySteve · on Feb 13, 2019

In order for the scan to be most efficient in cache, you need to store key-reference pairs in the buckets, so the swap cost should only be a couple instructions rather than a memcpy.

jnordwick · on Feb 13, 2019

I've written a couple cache-aware hashes for work (fintech), including an always correct version of this that rehashed when buckets filled. I always call them multi-buckets for lack of a better term.

Most data structures that rely on a chain of items can be improved by bucketing for cache lines (even linked lists).

I've also implemented 32 byte multi-buckets for a cuckoo hash scheme to try and balance search time and memory usages (you can get extremely high load factors using this approach). I don't think this approach is any better than a few other similar multiple-hashing schemes though.

I have tried keeping the hashs together in memory and using SSE compare instructions to do the comparison/mask of the entire bucket in a couple instructions, but have found the technique hit/miss.