Hacker News new | past | comments | ask | show | jobs | submit login

They praise bloom filters for being simple and describe how it works in two sentences. They criticize cuckoo filters for being too complicated. They praise their own algorithm for being simple, and then they don't describe it, and instead link to a repo of several hundred lines of code.

My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.




The xor filter is more complex, but the concepts behind it are quite accessible. Let's give it a try.

The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k).[0] If they are equal, the key is assumed to be contained in the filter.

Assuming that the fingerprint function is random, the probability of false positives is 2^-K.

Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:

   a1[h1(k_i)] + a2[h2(k_i)] + a3[h3(k_i)] = f(k_i) for i = 1..N
If the arrays are big enough (each one a fraction larger than N/3 where N is the number of elements), then the probability is high that the system has a solution (this comes down to the hyperedges {h1(k_i), h2(k_i), h3(k_i)} being "acyclical" -- the acyclic part would be easier to understand if you only had two hash functions giving you normal undirected edges, but the proof of acyclicity only works with 3 or more hash functions). If it doesn't, you just pick new hash functions at random and retry.

The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.

[0] The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.

Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.


He probably did not talk about it because this post was an announcement of his paper: https://arxiv.org/abs/1912.08258, the XOR filter description starts at page 3.


There's also the linked paper that has algorithmic description (starts on p. 3).


I agree, I kept reading to see their approach and "woof, there it is, so much better, you work it out!". sigh.

Rule #1 of getting something adopted breached.


>My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else.

Or, one could just read the code instead of guessing - it's not that hard to follow, similar to Bloom idea + XORing the hashes.


> not that hard to follow

I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: