The Jaccard similarity (or Jaccard index) mentioned in the article is used for finding similar items in a big set. E.g. Google News uses it for aggregating articles on the same subject.
Indeed, bloom filters work but, as with all things, do have limitations in practice. You have to make specific choices as to which type of bloom filter you will implement. You have to optimize for variables like n, m or k or you quickly begin to lose confidence in your responses. Also need to optimize for space vs speed.
Sergei Vassilvitskii (of Yahoo! Research) went over this first class of "COMS 6998-12: Dealing with Massive Data", http://www.cs.columbia.edu/~coms699812/ just a few weeks ago. Very interesting stuff. The class is dealing specifically with large data handling in both parallel processing and streaming techniques.
It looks as if a major component of the class is mathematical distillation of complex problems trading exact matching for acceptable probabilistic bounds. When you are able to do these math tricks you can cut an O(n^2) or worse problem down to O(n log n) or better problem which is the real win.
Good article. I like that it was more approachable than a journal paper -- as a developer, I am interested in these kind of algorithms but sometimes I don't even know where to begin searching. And as someone without a math PhD, I appreciate the simplifications and real-world applications.
I don't know what you have planned for the next article in the series, but I would recommend adding code snippets in a language of your choice. I would have liked to see your example implemented in python or ruby to make it more concrete.
I'm planning to discuss a couple of different techniques (using shingling to actually create the set elements from items is the next one) and then demonstrate them on a large dataset.
I'll start including some Python in the follow up ones.
"Suppose you have a huge number of items that you would like to group together by a fuzzy notion of similarity. Suppose the only tool available to you is a key-value store. Suppose you only have the resources to consider each object once. Never fear, simhashing is here!"
"It's been raining all day and you find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you'd really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items). Clearly we don't want to scan our entire existing database of items every time we receive a new item but how do we avoid doing so? Minhashing to the rescue!"
Heh, you're right, it does look pretty similar. I wasn't aware of that article until someone posted it on Disqus comments just now.
The article came out of my reading the data mining slides from the Ullman - Stanford course, maybe there's common terminology? I know I tried to keep it consistent with the wikipedia Jaccard Index.
It is clear that the author does not understand what a hashing function is. He says, "Your programming language of choice's API will almost certainly have a few acceptable ones within arm's reach", but that's not true. Firstly, what is a programming language's "API" ? Secondly: most languages that I'm aware of have no builtin hash functions; they are usually libraries (I know, I'm getting pedantic here). And thirdly: most hash functions that you'd find in a library don't return a floating point value that you can operate on with a "min" function (or even an int value). They typically return a string of 40-, 128 or 256 bits (depending on the function). So you'll need to map the hash returned by your hash function to a value that you can compare; for example, by grabbing the last 4 bytes and treating them as an unsigned integer.
Its obvious he means standard library by 'API'. Also the type returned doesn't need to be an integer, or even a number. If you need to write a min function then the type merely needs to form a partially ordered set. A string could fulfill this requirement - simply compare an integer representation of the first character if they're equal move on to the following character. Alphabetical ordering, taught to children at all good primary schools, is a restricted version of this comparison algorithm.
Hmm, I know it's a bit of an esoteric language, but Java has a method called "hashCode()" which is common to all objects, and does indeed return an int value.
That's points 2 and 3 covered, I'll leave it up to you whether or not you want to split hairs on what's part of Java's "API" (a sensible reader would probably take that as the Java Class Library in the case of Java).
Basically it is a probabilistic bitmap index for quickly telling if a certain elem is in a set. It is used in BigTable, Hbase etc.
http://en.wikipedia.org/wiki/Bloom_filter
The Jaccard similarity (or Jaccard index) mentioned in the article is used for finding similar items in a big set. E.g. Google News uses it for aggregating articles on the same subject.