Hacker News new | past | comments | ask | show | jobs | submit login
Bloom Filters by example (billmill.org)
167 points by gpsarakis on Oct 28, 2013 | hide | past | favorite | 61 comments



Interesting anecdote: an alumni of my university was chatting with me recently about his work at Google, essentially he replaced what he described as a "large, complex machine learning pipeline" with a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.

For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).


Is EE shop sites you would have visited before the search?

> a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.

"have searched" is not clear to me. I was expecting Google would just spit out a JSON response with all the search query for page 1 (including sponsored ads). Can you please elaborate on this please?


> Is EE shop sites you would have visited before the search?

No. I think you might be parsing his statement wrong - it's not "have searched" as in searched for previously, it's "have searched for a PRODUCT" as in, your search is for a product (something purchasable in a store, preferably an online store) as opposed to something else.

The Bloom filter figures out if your search is for a product, and if it is, puts that box w/ sponsored links to stores that sell that product up at the top of your results.


Jason Davies made a nice visual explanation of bloom filters: https://www.jasondavies.com/bloomfilter/


His demo is much smoother than mine, I'm jealous.


I always loved bloom filters, because to me, they showed something that was very fundamentally important.

That the amount of information required so that you can say whether an element is in a set or not is significantly less that the set itself.

That is just wierd, but amazing.

If someone asks me about weird and wonderful things in computer science, this is what I talk about.


Well it doesn't tell you the element's membership. It only can tell you it might be in, or definitely isn't.

Like reviewing a passenger list where everyone's name is abbreviated to initials.


Now we just need HyperLogLog by example.

(HyperLogLogs are vaguely similar data structures in that you hash things a bunch to store a value inside, except instead of set-membership inquiries, they're better at cardinality-estimation purposes.)


Not in the style of the submitted article, but HyperLogLog was mentioned [0] last year on Nick's Blogs Damn Cool Algortihms series [1]. Discussion on HN [2].

[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

[1] http://blog.notdot.net/tag/damn-cool-algorithms

[2] https://news.ycombinator.com/item?id=4488946


Well, there you are: a nice explanation of HyperLogLog, along with graphs, and a javascript demo to play with. And a tribute to Philippe Flajolet who came up with the algorithm.

http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...

The Aggregate Knowledge blog is mind blowing for everything sketch related.


Is that demo widget custom, or did they use a tool to generate it? Do you know?


According to this comment http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-... , it's custom made and based on d3.


It would make the world a lot better if somebody made it easy to create that widget in IPython Notebook or some similar tool.

#IShouldDoThatButIProbablyWont


I hope you write it! I also imagine an encyclopedia of many different algorithm examples, each with demonstrations and graphical explanations. Bret Victor has talked about that sort of thing, and it greatly intrigues me.


I'm not sure what Bret Victor had in mind, but the second part of Skiena's text The Algorithm Design Manual already does a good job at this. Similar write ups are available online [0], in less detail than the text, e.g. [1]. They all open with a pictorial representation of possible input and corresponding output to aid in scanning when you have a problem to solve. There is also the Encyclopedia of Algorithms [2], and the NIST Dictionary of Algorithms and Data Structures [3], history of at [4].

[0] http://www.cs.sunysb.edu/~algorith/

[1] http://www.cs.sunysb.edu/~algorith/files/dictionaries.shtml

[2] https://www.springer.com/computer/theoretical+computer+scien...

[3] http://xlinux.nist.gov/dads/

[4] https://en.wikipedia.org/wiki/Dictionary_of_Algorithms_and_D...


Question, instead of using two hash functions, why not use just a single function but use a = hash(foo) and b = hash(foo+"something")? A well distributed hash function should give drastically different results for a and b for different values of foo, which is what we want, correct?


It's standard to have at least two hashing functions. You can then simulate having x hash functions simply by combining those two.[1]

But, your question is entirely valid and I'm not really sure why you couldn't do that. It might simply be that the kinds of hash functions used in bloom filters (non-cryptographic) are not well distributed.

[1]: http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060...


Fascinating, thank you! I'll take a look at the paper when I get back home.


That's what I do, and it works just fine. I was going to raise this point but you beat me to it.


Chapter 26 of 'Real World Haskell' shows how to design a library by implementing a bloom filter: http://book.realworldhaskell.org/read/advanced-library-desig...


bloomd[1] is a "high performance c server for bloom filters" for those looking for a low-overhead and deployable implementation with quite a few clients, like Python and Ruby.

[1] https://github.com/armon/bloomd


Wow, I didn't realize they were this simple to implement and reason about!


This looks very cool, however I was able to get lots of false positives using the first few letters of the alphabet as test data. Could you reduce these by using k hashes and k bit vectors with m bits, i.e. one bit vector for each hash algorithm, at the expense of a little extra computation?


That's probably because it's a tiny filter. In reality you'd calculate the size more appropriately:

http://hur.st/bloomfilter?n=4&p=1.0E-20

Here you can give it a false positive rate you're happy with and a number of elements. It'll give you the optimal size and number of hashes to use. Your suggestion would use far too much space.


The expense is space, not so much the extra computation.

The nice thing about Bloom filters is that (if your hash functions are good), that you compute the vector length given the number of set elements and the probability of false positives that is acceptable in your application.


Google Chrome not only uses bloom filters for malicious URL filtering but also as a part of the CSS selector matching process. I'm not sure how exactly, but allegedly, it gives them a nice speed bump in majority of the cases without compromising on generality.


BFs are getting pretty popular in bioinformatics. The post mentions one example, here is my favorite recent one: http://minia.genouest.org/files/minia.pdf


Very cool – I hadn't seen this paper before (haven't done any real work on de novo assembly or otherwise requiring de Bruijn graphs).

Here's a paper using Bloom filters in metagenomic classification (my relative* area): http://bioinformatics.oxfordjournals.org/content/26/13/1595....

In this vein, a friend and I are researching/implementing a probabilistic key-value store for some bioinformatics applications (one is metagenomic organism and gene identification). It's fast and space-efficient, just like a BF (though obviously less so as it stores keys and not single-set-membership). Any use cases for that kind of thing in your sub-field? Always trying to figure out interesting new applications (we aren't ready to write it all up yet, but hope to at some point not too far down the road).

* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.


Can you explain "probabilistic key-value store?" Would it be that each gene has some defined probability of belonging to a given organism, or is it probabilistic in the sense of having a defined error rate as BFs do?


The latter - probabilistic in the sense of having a defined error rate. In one of our use cases, the keys are simply kmers and the values are organism or gene IDs.


not sure, but would definitely take a look.


Contact info?


You'll find it here: http://tau.ac.il/~rozovr/


Coincidence: used this page on Friday to implement bloom filters for a kata at Software Craftsmanship 2013 (in Bletchley Park of all places).

Fun to implement, I'd recommend having a go. Here's our terrible clojure code: https://gist.github.com/timruffles/7195405


Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.

One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?


This is kind of a fundamental part of bloom filters. They're probabilistic data structures, which is the trade you make for space efficiency.

If you get a false back, you can guarantee that some item is not in the filter. If you get a true back, you can be fairly certain it is. Once you have a degree of certainty, you can do a more thorough check for membership while cutting out large numbers of negatives.


I get the concept - you're trading space for certainty. It's more the bit where it says:

    "The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"
Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?


I agree with what chamblin said, and to add one more point to that, you can choose the amount of uncertainty by using a larger or smaller bit array relative to the number of items you will have.


Another great Bloom Filter library not mentioned in this article is Bitly's implementation on GitHub[0]. It has the best of both worlds: a C implementation and higher-level wrappers (Golang, PHP, Python, and Common Lisp). If you don't want to play around with a fancier DBMS and really just want the damn filters, I would look here.

[0] https://github.com/bitly/dablooms


It is elliptically mentioned, via the link to this wonderful pull request: https://github.com/bitly/dablooms/pull/19


There is nice summary of other interesting probabilistic data structures here: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...


Last time I checked, the Cassandra implementation was really messed up. Myself and another dev at our last company spent a few weeks debugging, adding test cases and fixing the code to get it to work correctly.


Cassandra user in production here. Could you elaborate a little bit on what was messed up (size, false positives etc.) and which version this was? Did you have trouble with Cassandra itself that you tracked down to the bloom filters or trying to re-use the bloom filters in another project?


I'm only familiar with a Python clone of the Cassandra implementation ("hydra" -- used it a while back), but two issues I do remember are: 1) I believe it only uses 32-bit ints for the bit array addressing, so you can overflow it (and this also may be less-than-ideal from a hash distribution perspective, but I don't know offhand); and 2) as someone coming from a different background, I found the whole thing to have a bit too much "OO" magic, with several helper classes to set up the filter that (to me) obfuscate what's really going on.


Anyone got pointers to "simple" real-world use cases?


Any read-world use case where you want to check set membership when false positives are allowed.

E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).

Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.

For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].

As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).


Jeremy Edberg mentioned in his talk on scaling reddit [0] that they used a bloom filter to provide fast (negative) lookup of whether or not a story or comment had been up voted by the user. The negative case was important since it's often the case that the majority of stories or comments appearing on a page will not have been up voted by the user.

[0] http://www.infoq.com/presentations/scaling-reddit


To be pedantic, what we did was take advantage of Cassandra's bloom filter.


WebKit uses a bloom filter to optimize CSS descendant selector matching: http://trac.webkit.org/changeset/77777


You could use them for rough spell-checking by hashing all the words in a dictionary. You would then need some additional data structure for overcoming false positives though, but any word rejected would necessarily not be a member of the original dictionary. The BF size would be much cheaper to store than the original dictionary and searching for each word would be pretty fast (though not necessarily faster than exact string lookups).


Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.


Specifically, a place where DBs use bloom filters is distributed joins. This isn't a common scenario IMHO. Found a good description here:

http://www.coolsnap.net/kevin/?p=19

P.S. Loved the tutorial.


it's also useful on local hash joins because it's faster than a regular hash table lookup. so you can use it to eliminate some stuff before going to the big hash table.


> Chrome also uses them for malware detection I think.

Used to! I had them in earlier versions of the article but they changed that code.


Do you know what they changed them to and why?


Used to! But it was quite a while ago that I noticed that they had changed it, I don't remember any longer.


To generalize, "possible match detected" for anything: you have a big, fat database about a bigger, fatter universe of identifiers, and a less-capable set of clients over a less-capable network link. They want a fast way of determining whether or not a particular identifier is worth querying the database for, and the false positives do not have a very high cost.

The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.


I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.


Bitcoin nodes now use bloom filters to enable fast transaction lookups between peers.


I had just put down this bloom filter tutorial by Jason Davies, now another one comes up. It's a sign. www.jasondavies.com/bloomfilter/


As an example of real usage - In my ChromeBook home folder, there are several "bloom" filter files for various predefined filtering in chrome




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: