While bloom filters are nice for understanding, as far as I know they've been surpassed by Cuckoo filters, and now (maybe?) Vacuum Filters.
I built a torrent-like streaming application for video a few years ago on WebRTC and bloom filters made it easy to know whether a peer would have a specific block. The false positives of the bloom filter were fine too, as the 1% of the time it had to request a block and a failure is returned was not a bottleneck to the application.
For static sets (where you construct the filter once and then use it for lookup), blocked Bloom filters are the fastest, for lookup. They do need a bit more space (maybe 10% more than Bloom filters). Also very fast are binary fuse filters (which are new), and xor filters. They also save a lot of space compared to others. Cuckoo filters, ribbon filters, and Bloom filters are a bit slower. It's a trade-off between space and lookup speed really.
For dynamic sets (where you can add and remove entries later), the fastest (again for lookup) are "Succinct counting blocked Bloom filter" (no paper yet for this): they are a combination of blocked Bloom filters and counting Bloom filters, so lookup is identical to the blocked Bloom filter. Then cuckoo filters, and counting Bloom filters.
Incidentally, for people who don't know this already, if you want to just generally Google about these data structures and learn about them, the search term is "succinct data structure". There's been a lot of interesting work in this area. And they are not lossy in general; obviously lossy structures are interesting but there's some mind-blowing non-lossy succinct data structures as well.
It should be relatively easy to port it to other programming languages.
Compared to regular counting Bloom filters, there are some advantages (e.g. uses half the space, lookup is much faster, no risk of overflows). It has a disadvantage: add+remove are slower (currently). Cuckoo filters need less space, but otherwise advantages+disadvantages are the same.
For more questions, just open an issue on that project (I'm one of the authors).
This paper seems to use an un-optimized Bloom filter implementation. It also lists cuckoo filters to be 10 times faster than Bloom filters. In other papers, e.g. the cuckoo filter paper and the xor filter paper, the difference is _much_ smaller; maybe factor 2.
Yes. Some variants of ribbon filters are very fast, and other variants (which are a bit slower thought) need very little space: only a few percent more than the theoretical lower bound.
Is anyone curating the latest developments in this space? I feel like I only stumbled upon Vacuum filters and now Ribbon filters just because of the top comment on a HN thread. Is anyone doing a "latest/developments in the datastructure/algo space"?
That's a good question! This paper also mentions "Dele. BF [43]" - the "The Deletable Bloom Filter: A New Member of the Bloom Family" https://www.researchgate.net/publication/260668624_The_Delet... . As far as I see, lookup is identical to the regular Bloom filter. Yet the vacuum filter paper lists the throughput of the Deletable Bloom Filter to be almost twice as the Bloom filter (1.95x). How can that be? Maybe I misunderstand something here.
Yes, a peer review is supposed to catch completely incorrect numbers. But in many cases, a reviewer doesn't have access to the source code, and no way to re-run the tests.
There's the "Replicated Computational Results initiative": https://jea.acm.org/rcr_initiative.cfm . Papers that pass this test will get a stamp certifying that its results are reproducible. The xor filter paper (where I'm a co-author) took part in this initiative.
ugh -- reading related papers on filters like the above exposed another pre-req i am missing: an understanding of math high enough to understand the math in the articles :/
Depending on how you look at it, bloom filters have been surpassed. In practice I find there are variables driven by implementation factors which can push bloom filters to the top choice.
Starting with language, you might not find what you need (eg vacuum is C only so far afaik). The complexity of the algorithm may force you to decide porting is problematic (understandability, effort, cost of verifying the port).
Then you goto the practical things. Do you need to handle: threads, merging, serialization, etc. after all this, you might find the only unicorn is bloom due to its simplicity, and wide implementation availability.
In my case, after lots of testing I ended up extending fastfilter Java’s bloom filter by adding merge (trivial in bloom), serialization, moving to XX3 hash, and simplified constructors. Doing all this in say Cuckoo or vacuum would have been a lot of work. I’m not certain if/how how merging might work in these offhand.
I like Bloom filters well enough, but I've noticed them getting cargo-culted into projects where they aren't even remotely necessary. There are a few techniques that people seem to love (Cuckoo hashes are another) and uncritically apply without much analysis.
Where they are good, they're good, but I've seen techniques with Impressive Papers being used more or less at random where the basic data structure would have been fine.
I built a beautiful dynamic algorithm a while back to assign strings to "buckets" in Hyperscan. The problem is (roughly) that you can pack lots of strings into buckets for a "bucketed super-character shift-or matcher", but if you pack strings of wildly disparate length together (e.g. "foobar" and "xyz") into the same bucket they wind up weakening the power of the matcher for the longer strings (it's still correct; the primary table is just a prefilter). But "foobar" and "xyz" together are pretty close to "bar" and "xyz" (there are some subtleties).
So the problem was you have a bunch of strings of different lengths but only 8 buckets to pack things into. I had some neat heuristics to estimate the cost of weakening strings of various lengths based on estimated false positive rates - and then a dynamic function to optimize packing for various choices. However, I buggered up the heuristic function. Due to the complexity of the dynamic algorithm, the magic numbers were pretty hard to debug or even understand, and it created a weird performance corner we didn't discover for a year or so. Most of the answers it gave were pretty plausible even with the wrong heuristic... sigh.
That would be mean, and perhaps even career-limiting or break NDAs or both.
One example I can point to is the fact that we (Sensory Networks) had cargo-culted in a recursive hash function for a string matching engine (back when we made hardware). There was no real reason to be using a recursive hash at all, but there was a cool paper from the NSA about recursive hashing, so there it was. No-one had even evaluated anything simpler.
Again, recursive hashing is cool. Arguably Bloom filters and (maybe) Cuckoo hashes are cool too - but my problem isn't with the existence of these techniques but the way that people seem to get a rush of blood to the head and use them when they're kind of pointless and overcomplex.
Cuckoo hashes are great if you really understand why you're using them, but 99% of people would be fine with chaining for most applications or a simpler set-associative scheme if open addressing is not negotiable.
Their locality is problematic, they are fantastically hard to get right in a multi-reader/multi-writer case (if you don't want to just lock the whole damn table) and they're just all-round complex with surprising sharp edges. The fact that you have to go wandering off with a bunch of unpredictable data-dependent reads if you don't find what you're looking for is ... wild. Alternately you can pay the costs of multiple parallel accesses preemptively, but now you've smashed throughput in case the table is big enough to be bandwidth limited (L3, memory).
They look pretty great when the load factor is low, but everything looks great with a light load factor.
As far as open-addressing schemes go, they're OK, but I see people reaching for them for reasons that don't seem to make any sense.
Biggest problem with Cuckoo tables is that if your hash function doesn't have the right properties insertions do not converge. They literally go into an infinite loop.
This makes them not viable as a generic primitive for hash tables.
Besides, as you mention, that they're in fact slower than modern engineered hash table implementation that do only one random memory access in the vast majority of cases.
Yes, exactly. Linear probing is often easier and faster. Using buckets also helps (each bucket contains multiple slots), for both cuckoo hashing and linear probing. E.g. the cuckoo filter uses a bucket size of 4. See also https://gankra.github.io/blah/hashbrown-tldr/ for a fast hash table implementation.
I'm sorry! I thought this implementation uses linear probing. Yes this is not an example of linear probing. But it is much closer to linear probing than to cuckoo hashing (I hope we can agree on this). I also found this blog entry interesting: https://probablydance.com/2017/02/26/i-wrote-the-fastest-has... (of course "fastest" is relative, when it comes to benchmarks)
Yes exactly. Actually you just need one 64-bit hash to generate all the hashes needed by the Bloom filter. And the somewhat slow modulo operation can be replaced with multiply+shift (fastRange). See e.g. https://github.com/FastFilter/fastfilter_cpp/blob/master/src...
for (int i = 0; i < k; i++) {
data[fastRange(a, data.length)] |= getBit(a);
a += b;
}
Sometimes I think of the human memory as behaving like cascading bloom filters, one bloom filter narrowing down the retrieval of a memory to a smaller set of bloom filters. Yet at the very bottom it is still just a bloom filter, and it still might be a false positive (false memory).
Well the title is plainly wrong, classic bloomfilter is not hashmap at all. It's a probalistic membership check data structure. It might be backed by a hashmap or some other exact membership data structure
It is referring to how to use a bloomfilter. You can use a bloomfilter instead of a hashtable in some situations and it will be more space efficient at the expense of accuracy but the article talks about other uses for them as well.
I am seeing more and more people glossing over fundamental concepts with wishy-washy analogy. I think those are a form of violation of the software engineering profession's spirit, where the rule should be that preciseness is paid the uttermost attention and carefulness.
I say this not because I somehow attach certain prestige to software engineering. But actually software engineering, because it works on top of a precise (in the practical sense) foundation (the modern computing chips and machine architecture), preciseness becomes something that can be achieved from the ground up for the entirety of the software. In other words, preciseness is the most valuable virtue of the profession. Forsaking it is like a human refuses to communicate with verbal and writing languages (in comparative sense, of cuz one can decide to express themselves in many forms).
I believe that if you have two bloom filters containing the hashes of each set, you can do use bitwise AND or a bitwise OR to get approximate intersection or union of the two sets.
HyperLogLog (and variants) can do the union part as well, and need much less space than Bloom filters thought. For large sets, that is. For cardinality smaller than e.g. 100, Bloom filters are better.
> Another similar one is the cuckoo filter which works in a similar way to a bloom filter, except it has a inbuilt limit after which it will no longer add items, can delete items
Bloom filters have the same "inbuilt limit" unless you're okay with your false positive rate becoming one!
> and has some different memory characteristics.
Yes, radically lower numbers of memory accesses needed for filter hits, especially in filters with low false positive rates.
How do you use a bloom filter? Because that's never been a limitation in my use cases.
The only times I've really needed them is when I needed to know a db row has been updated in this import run. (Can't fit hash in memory, updates happen in batch/async so writes haven't been made yet)
The article mentions using it for indicating which things are in a cache. Not being able to delete entries seems like a severe limitation to effectiveness over time, as false positives increase.
If you need the ability to remove entries, then I would consider the counting Bloom filter (which is simpler, but needs more space), and the cuckoo filter. But you do need to be a bit careful as both can overflow if you are unlucky, or if you add too many entries. In which case you would have to re-build them. So better be prepared for this case, or prevent overflowing the filters by using a cascade of filters.
Use n copies of the filter staggered in time. Start one on first day/hour/some-time-unit, then on next tick of time unit start a new one. The one started on day 2 lacks the entries from day 1. For checking, only use the oldest one. On the nth day, as you are making yet another one, discard the oldest one and the second oldest one becomes the new oldest one. The oldest one always has n time units of data in it.
That's clever! You can use this trick to effectively make items expire and be removed after a set time.
But it makes the whole thing more complicated with parameters that need tuning, and no longer the simple and powerful things that bloom filters are sold as.
If you count false positives, you can recreate your filter when you get a specific count of false positives. If you also count true positives, you can recreate the filter when the ratio between true and false positives hits some number.
Still needs tuning though, depends on how costly it is to generate (and transfer, if needed) vs how costly a false positive is.
Lots of caches don’t need precise invalidation, either because they have a short lifespan (e.g. request-bound) or because the removals are so rare it’s not that expensive to blow the entire cache in the grand scheme of things.
in a use case i have for something _like_ bloom filters (though i want to evaluate Cuckoo/Vacuum/etc) i have an immutable collection of changes that happened during an event, and need to query if a specific change type is in that event.
Given that it's a historical event, there's no need for removal. My concern (due to my specific application needs) is primarily storage size and query speed.
I built a torrent-like streaming application for video a few years ago on WebRTC and bloom filters made it easy to know whether a peer would have a specific block. The false positives of the bloom filter were fine too, as the 1% of the time it had to request a block and a failure is returned was not a bottleneck to the application.