Hacker News new | past | comments | ask | show | jobs | submit login
Bloom Filters: More than a space-efficient hashmap (boyter.org)
173 points by boyter on Aug 11, 2021 | hide | past | favorite | 53 comments



While bloom filters are nice for understanding, as far as I know they've been surpassed by Cuckoo filters, and now (maybe?) Vacuum Filters.

I built a torrent-like streaming application for video a few years ago on WebRTC and bloom filters made it easy to know whether a peer would have a specific block. The false positives of the bloom filter were fine too, as the 1% of the time it had to request a block and a failure is returned was not a bottleneck to the application.


There are many alternatives to Bloom filters, but some variants of Bloom filters are still competitive. I'm one of the authors of some benchmarks for filters: https://github.com/FastFilter/fastfilter_cpp (this is based on the cuckoo filter benchmark) and https://github.com/FastFilter/fastfilter_java

For static sets (where you construct the filter once and then use it for lookup), blocked Bloom filters are the fastest, for lookup. They do need a bit more space (maybe 10% more than Bloom filters). Also very fast are binary fuse filters (which are new), and xor filters. They also save a lot of space compared to others. Cuckoo filters, ribbon filters, and Bloom filters are a bit slower. It's a trade-off between space and lookup speed really.

For dynamic sets (where you can add and remove entries later), the fastest (again for lookup) are "Succinct counting blocked Bloom filter" (no paper yet for this): they are a combination of blocked Bloom filters and counting Bloom filters, so lookup is identical to the blocked Bloom filter. Then cuckoo filters, and counting Bloom filters.


> "Succinct counting blocked Bloom filter"

Incidentally, for people who don't know this already, if you want to just generally Google about these data structures and learn about them, the search term is "succinct data structure". There's been a lot of interesting work in this area. And they are not lossy in general; obviously lossy structures are interesting but there's some mind-blowing non-lossy succinct data structures as well.


Thank you for posting the correct incantation for one to speak at the oracle.


Where can I find more information about the "Succinct counting blocked Bloom filter"? Can you point to an implementation or a document? Thanks.


Sure. So far, there are only two Java implementations. One is using "Rank" and the other is using "Select":

https://github.com/FastFilter/fastfilter_java/blob/master/fa...

https://github.com/FastFilter/fastfilter_java/blob/master/fa...

It should be relatively easy to port it to other programming languages.

Compared to regular counting Bloom filters, there are some advantages (e.g. uses half the space, lookup is much faster, no risk of overflows). It has a disadvantage: add+remove are slower (currently). Cuckoo filters need less space, but otherwise advantages+disadvantages are the same.

For more questions, just open an issue on that project (I'm one of the authors).


Thank you!


Table1 here shows Vacuum filters being 11x faster than bloom. https://www.vldb.org/pvldb/vol13/p197-wang.pdf


This paper seems to use an un-optimized Bloom filter implementation. It also lists cuckoo filters to be 10 times faster than Bloom filters. In other papers, e.g. the cuckoo filter paper and the xor filter paper, the difference is _much_ smaller; maybe factor 2.

https://dl.acm.org/doi/pdf/10.1145/2674005.2674994 - figures 5 and 6

https://arxiv.org/abs/1912.08258 - figures 3b and 4b



Yes. Some variants of ribbon filters are very fast, and other variants (which are a bit slower thought) need very little space: only a few percent more than the theoretical lower bound.

For static sets, ribbon filters and binary fuse filters (e.g. here: https://github.com/FastFilter/xor_singleheader) are very competitive. Both are based on recent (2019 and newer) theoretical work from Stefan Walzer, e.g. this one https://arxiv.org/pdf/1907.04749.pdf


Is anyone curating the latest developments in this space? I feel like I only stumbled upon Vacuum filters and now Ribbon filters just because of the top comment on a HN thread. Is anyone doing a "latest/developments in the datastructure/algo space"?


Bonus points if someone starts this and breaks it down for the more pedestrian IQ'd folks like myself.


Thanks for this context!

I wonder how the peer review process missed the use of an inefficent bloom filter implementation?


That's a good question! This paper also mentions "Dele. BF [43]" - the "The Deletable Bloom Filter: A New Member of the Bloom Family" https://www.researchgate.net/publication/260668624_The_Delet... . As far as I see, lookup is identical to the regular Bloom filter. Yet the vacuum filter paper lists the throughput of the Deletable Bloom Filter to be almost twice as the Bloom filter (1.95x). How can that be? Maybe I misunderstand something here.

Yes, a peer review is supposed to catch completely incorrect numbers. But in many cases, a reviewer doesn't have access to the source code, and no way to re-run the tests.

There's the "Replicated Computational Results initiative": https://jea.acm.org/rcr_initiative.cfm . Papers that pass this test will get a stamp certifying that its results are reproducible. The xor filter paper (where I'm a co-author) took part in this initiative.


ugh -- reading related papers on filters like the above exposed another pre-req i am missing: an understanding of math high enough to understand the math in the articles :/


Depending on how you look at it, bloom filters have been surpassed. In practice I find there are variables driven by implementation factors which can push bloom filters to the top choice.

Starting with language, you might not find what you need (eg vacuum is C only so far afaik). The complexity of the algorithm may force you to decide porting is problematic (understandability, effort, cost of verifying the port).

Then you goto the practical things. Do you need to handle: threads, merging, serialization, etc. after all this, you might find the only unicorn is bloom due to its simplicity, and wide implementation availability.

In my case, after lots of testing I ended up extending fastfilter Java’s bloom filter by adding merge (trivial in bloom), serialization, moving to XX3 hash, and simplified constructors. Doing all this in say Cuckoo or vacuum would have been a lot of work. I’m not certain if/how how merging might work in these offhand.


I like Bloom filters well enough, but I've noticed them getting cargo-culted into projects where they aren't even remotely necessary. There are a few techniques that people seem to love (Cuckoo hashes are another) and uncritically apply without much analysis.

Where they are good, they're good, but I've seen techniques with Impressive Papers being used more or less at random where the basic data structure would have been fine.


Same goes for prefix and suffix trees. And for a lot of dynamic algorithms.

It is tough, as these are definitely good skills. But they are stupid easy ways to stall out projects.


I built a beautiful dynamic algorithm a while back to assign strings to "buckets" in Hyperscan. The problem is (roughly) that you can pack lots of strings into buckets for a "bucketed super-character shift-or matcher", but if you pack strings of wildly disparate length together (e.g. "foobar" and "xyz") into the same bucket they wind up weakening the power of the matcher for the longer strings (it's still correct; the primary table is just a prefilter). But "foobar" and "xyz" together are pretty close to "bar" and "xyz" (there are some subtleties).

So the problem was you have a bunch of strings of different lengths but only 8 buckets to pack things into. I had some neat heuristics to estimate the cost of weakening strings of various lengths based on estimated false positive rates - and then a dynamic function to optimize packing for various choices. However, I buggered up the heuristic function. Due to the complexity of the dynamic algorithm, the magic numbers were pretty hard to debug or even understand, and it created a weird performance corner we didn't discover for a year or so. Most of the answers it gave were pretty plausible even with the wrong heuristic... sigh.


> I've noticed them getting cargo-culted into projects where they aren't even remotely necessary

Do you have some examples that can be shared?


That would be mean, and perhaps even career-limiting or break NDAs or both.

One example I can point to is the fact that we (Sensory Networks) had cargo-culted in a recursive hash function for a string matching engine (back when we made hardware). There was no real reason to be using a recursive hash at all, but there was a cool paper from the NSA about recursive hashing, so there it was. No-one had even evaluated anything simpler.

Again, recursive hashing is cool. Arguably Bloom filters and (maybe) Cuckoo hashes are cool too - but my problem isn't with the existence of these techniques but the way that people seem to get a rush of blood to the head and use them when they're kind of pointless and overcomplex.


What issues do you have with cuckoo hashing? Unlike bloom filters it seems like a completely reasonable default / baseline no?


Cuckoo hashes are great if you really understand why you're using them, but 99% of people would be fine with chaining for most applications or a simpler set-associative scheme if open addressing is not negotiable.

Their locality is problematic, they are fantastically hard to get right in a multi-reader/multi-writer case (if you don't want to just lock the whole damn table) and they're just all-round complex with surprising sharp edges. The fact that you have to go wandering off with a bunch of unpredictable data-dependent reads if you don't find what you're looking for is ... wild. Alternately you can pay the costs of multiple parallel accesses preemptively, but now you've smashed throughput in case the table is big enough to be bandwidth limited (L3, memory).

They look pretty great when the load factor is low, but everything looks great with a light load factor.

As far as open-addressing schemes go, they're OK, but I see people reaching for them for reasons that don't seem to make any sense.


Biggest problem with Cuckoo tables is that if your hash function doesn't have the right properties insertions do not converge. They literally go into an infinite loop. This makes them not viable as a generic primitive for hash tables.

Besides, as you mention, that they're in fact slower than modern engineered hash table implementation that do only one random memory access in the vast majority of cases.


Yes, exactly. Linear probing is often easier and faster. Using buckets also helps (each bucket contains multiple slots), for both cuckoo hashing and linear probing. E.g. the cuckoo filter uses a bucket size of 4. See also https://gankra.github.io/blah/hashbrown-tldr/ for a fast hash table implementation.


Using hashbrown as evidence in a comment defending linear probing is odd when it specifically notes that it uses quadratic probing.


I'm sorry! I thought this implementation uses linear probing. Yes this is not an example of linear probing. But it is much closer to linear probing than to cuckoo hashing (I hope we can agree on this). I also found this blog entry interesting: https://probablydance.com/2017/02/26/i-wrote-the-fastest-has... (of course "fastest" is relative, when it comes to benchmarks)


http://cap-lore.com/code/BloomAnno.html is a much shorter page that has a lot more interesting stuff about Bloom filters in it in a lot fewer words.


A bloom filter is much much ... other than a hashmap.


Isn’t it more like a hash set than a hash map?


Yes


It's worth knowing that you don't need to generate N hashes - you can use https://en.wikipedia.org/wiki/Double_hashing to generate all N hashes from just 2 hashes.


Yes exactly. Actually you just need one 64-bit hash to generate all the hashes needed by the Bloom filter. And the somewhat slow modulo operation can be replaced with multiply+shift (fastRange). See e.g. https://github.com/FastFilter/fastfilter_cpp/blob/master/src...

for (int i = 0; i < k; i++) { data[fastRange(a, data.length)] |= getBit(a); a += b; }


Sometimes I think of the human memory as behaving like cascading bloom filters, one bloom filter narrowing down the retrieval of a memory to a smaller set of bloom filters. Yet at the very bottom it is still just a bloom filter, and it still might be a false positive (false memory).


Well the title is plainly wrong, classic bloomfilter is not hashmap at all. It's a probalistic membership check data structure. It might be backed by a hashmap or some other exact membership data structure


It is referring to how to use a bloomfilter. You can use a bloomfilter instead of a hashtable in some situations and it will be more space efficient at the expense of accuracy but the article talks about other uses for them as well.


I am seeing more and more people glossing over fundamental concepts with wishy-washy analogy. I think those are a form of violation of the software engineering profession's spirit, where the rule should be that preciseness is paid the uttermost attention and carefulness.

I say this not because I somehow attach certain prestige to software engineering. But actually software engineering, because it works on top of a precise (in the practical sense) foundation (the modern computing chips and machine architecture), preciseness becomes something that can be achieved from the ground up for the entirety of the software. In other words, preciseness is the most valuable virtue of the profession. Forsaking it is like a human refuses to communicate with verbal and writing languages (in comparative sense, of cuz one can decide to express themselves in many forms).


Bloom filters also can be used to approximate the size of the intersection and union of two sets.


Can you please expound on this a little bit more?


I believe that if you have two bloom filters containing the hashes of each set, you can do use bitwise AND or a bitwise OR to get approximate intersection or union of the two sets.


Yes exactly. Do a bitwise AND / OR, and count the number of bits that are set. From this count, you can estimate the number of entries in the set: https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the...

HyperLogLog (and variants) can do the union part as well, and need much less space than Bloom filters thought. For large sets, that is. For cardinality smaller than e.g. 100, Bloom filters are better.


yes!

Also another alternative is Theta datasketches.


> Another similar one is the cuckoo filter which works in a similar way to a bloom filter, except it has a inbuilt limit after which it will no longer add items, can delete items

Bloom filters have the same "inbuilt limit" unless you're okay with your false positive rate becoming one!

> and has some different memory characteristics.

Yes, radically lower numbers of memory accesses needed for filter hits, especially in filters with low false positive rates.


How does one use bloom filters in practice with their inability to remove items? Do one just toss it regularly and start over?


How do you use a bloom filter? Because that's never been a limitation in my use cases.

The only times I've really needed them is when I needed to know a db row has been updated in this import run. (Can't fit hash in memory, updates happen in batch/async so writes haven't been made yet)

I.e. basically using it for dedupe


The article mentions using it for indicating which things are in a cache. Not being able to delete entries seems like a severe limitation to effectiveness over time, as false positives increase.


If you need the ability to remove entries, then I would consider the counting Bloom filter (which is simpler, but needs more space), and the cuckoo filter. But you do need to be a bit careful as both can overflow if you are unlucky, or if you add too many entries. In which case you would have to re-build them. So better be prepared for this case, or prevent overflowing the filters by using a cascade of filters.

Some clever cache implementations internally use special kinds of Bloom filters, e.g. the Caffeine cache (Java) uses TinyLFU, which "builds upon Bloom filter theory" (from the abstract of the paper): https://github.com/ben-manes/caffeine and https://arxiv.org/pdf/1512.00727.pdf


Use n copies of the filter staggered in time. Start one on first day/hour/some-time-unit, then on next tick of time unit start a new one. The one started on day 2 lacks the entries from day 1. For checking, only use the oldest one. On the nth day, as you are making yet another one, discard the oldest one and the second oldest one becomes the new oldest one. The oldest one always has n time units of data in it.


That's clever! You can use this trick to effectively make items expire and be removed after a set time.

But it makes the whole thing more complicated with parameters that need tuning, and no longer the simple and powerful things that bloom filters are sold as.


If you count false positives, you can recreate your filter when you get a specific count of false positives. If you also count true positives, you can recreate the filter when the ratio between true and false positives hits some number.

Still needs tuning though, depends on how costly it is to generate (and transfer, if needed) vs how costly a false positive is.


Lots of caches don’t need precise invalidation, either because they have a short lifespan (e.g. request-bound) or because the removals are so rare it’s not that expensive to blow the entire cache in the grand scheme of things.


in a use case i have for something _like_ bloom filters (though i want to evaluate Cuckoo/Vacuum/etc) i have an immutable collection of changes that happened during an event, and need to query if a specific change type is in that event.

Given that it's a historical event, there's no need for removal. My concern (due to my specific application needs) is primarily storage size and query speed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: