Hacker News new | past | comments | ask | show | jobs | submit login

Anyone got pointers to "simple" real-world use cases?



Any read-world use case where you want to check set membership when false positives are allowed.

E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).

Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.

For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].

As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).


Jeremy Edberg mentioned in his talk on scaling reddit [0] that they used a bloom filter to provide fast (negative) lookup of whether or not a story or comment had been up voted by the user. The negative case was important since it's often the case that the majority of stories or comments appearing on a page will not have been up voted by the user.

[0] http://www.infoq.com/presentations/scaling-reddit


To be pedantic, what we did was take advantage of Cassandra's bloom filter.


WebKit uses a bloom filter to optimize CSS descendant selector matching: http://trac.webkit.org/changeset/77777


You could use them for rough spell-checking by hashing all the words in a dictionary. You would then need some additional data structure for overcoming false positives though, but any word rejected would necessarily not be a member of the original dictionary. The BF size would be much cheaper to store than the original dictionary and searching for each word would be pretty fast (though not necessarily faster than exact string lookups).


Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.


Specifically, a place where DBs use bloom filters is distributed joins. This isn't a common scenario IMHO. Found a good description here:

http://www.coolsnap.net/kevin/?p=19

P.S. Loved the tutorial.


it's also useful on local hash joins because it's faster than a regular hash table lookup. so you can use it to eliminate some stuff before going to the big hash table.


> Chrome also uses them for malware detection I think.

Used to! I had them in earlier versions of the article but they changed that code.


Do you know what they changed them to and why?


Used to! But it was quite a while ago that I noticed that they had changed it, I don't remember any longer.


To generalize, "possible match detected" for anything: you have a big, fat database about a bigger, fatter universe of identifiers, and a less-capable set of clients over a less-capable network link. They want a fast way of determining whether or not a particular identifier is worth querying the database for, and the false positives do not have a very high cost.

The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.


I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.


Bitcoin nodes now use bloom filters to enable fast transaction lookups between peers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: