Hacker News new | past | comments | ask | show | jobs | submit login

When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.

If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.

If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.

[Of course, all assuming someone can't attack your hash function.]




(Co-author of the paper.) You are right. However, it is a bit hard to find the threshold. Even the original author of the cuckoo filter didn't get it right, see https://github.com/efficient/cuckoofilter/issues/9 - for this reason we also lowered the maximum load to 0.94. However, I found (running the benchmarks) that sometimes that is not sufficient (insertion still fails). So, where is the cliff exactly? Is it 0.9? Or lower? And, as you say, it also depends on the hash function. We have used the Murmur3 finalizer, which is statistically very good. What if you use a weaker hash function? For a regular cuckoo hash table, you can just re-build it. But for a cuckoo filter rebuild is not easily possible as you don't have the keys.

You are also right regarding augmenting the cuckoo filter (e.g. "cuckoo hashing with a stash"), but it slows things down. Or use a different algorithm to insert (breath first I think). But it would add complexity.

I think, as is, the cuckoo filter is quite a tricky data structure: it can do a lot in theory, but it is very tricky to get right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: