Anyone got pointers to "simple" real-world use cases?

danieldk · on Oct 28, 2013

Any read-world use case where you want to check set membership when false positives are allowed.

E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).

Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.

For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].

As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).

dmunoz · on Oct 28, 2013

Jeremy Edberg mentioned in his talk on scaling reddit [0] that they used a bloom filter to provide fast (negative) lookup of whether or not a story or comment had been up voted by the user. The negative case was important since it's often the case that the majority of stories or comments appearing on a page will not have been up voted by the user.

[0] http://www.infoq.com/presentations/scaling-reddit

jedberg · on Oct 28, 2013

To be pedantic, what we did was take advantage of Cassandra's bloom filter.

panic · on Oct 28, 2013

WebKit uses a bloom filter to optimize CSS descendant selector matching: http://trac.webkit.org/changeset/77777

roye · on Oct 28, 2013

You could use them for rough spell-checking by hashing all the words in a dictionary. You would then need some additional data structure for overcoming false positives though, but any word rejected would necessarily not be a member of the original dictionary. The BF size would be much cheaper to store than the original dictionary and searching for each word would be pretty fast (though not necessarily faster than exact string lookups).

supergirl · on Oct 28, 2013

Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.

throwaway1979 · on Oct 28, 2013

Specifically, a place where DBs use bloom filters is distributed joins. This isn't a common scenario IMHO. Found a good description here:

http://www.coolsnap.net/kevin/?p=19

P.S. Loved the tutorial.

supergirl · on Oct 28, 2013

it's also useful on local hash joins because it's faster than a regular hash table lookup. so you can use it to eliminate some stuff before going to the big hash table.

llimllib · on Oct 28, 2013

> Chrome also uses them for malware detection I think.

Used to! I had them in earlier versions of the article but they changed that code.

im3w1l · on Oct 28, 2013

Do you know what they changed them to and why?

llimllib · on Oct 28, 2013

Used to! But it was quite a while ago that I noticed that they had changed it, I don't remember any longer.

fennecfoxen · on Oct 28, 2013

To generalize, "possible match detected" for anything: you have a big, fat database about a bigger, fatter universe of identifiers, and a less-capable set of clients over a less-capable network link. They want a fast way of determining whether or not a particular identifier is worth querying the database for, and the false positives do not have a very high cost.

The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.

lmb · on Oct 28, 2013

I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.

nwh · on Oct 28, 2013

Bitcoin nodes now use bloom filters to enable fast transaction lookups between peers.