Hacker News new | past | comments | ask | show | jobs | submit login
Finding similar items using minhashing (toao.com)
81 points by sadiq on Feb 14, 2011 | hide | past | favorite | 20 comments



In practice there is a certain kind of minhashing that is used in key-value stores, which is called a Bloom-filter.

Basically it is a probabilistic bitmap index for quickly telling if a certain elem is in a set. It is used in BigTable, Hbase etc.

http://en.wikipedia.org/wiki/Bloom_filter

The Jaccard similarity (or Jaccard index) mentioned in the article is used for finding similar items in a big set. E.g. Google News uses it for aggregating articles on the same subject.


Indeed, bloom filters work but, as with all things, do have limitations in practice. You have to make specific choices as to which type of bloom filter you will implement. You have to optimize for variables like n, m or k or you quickly begin to lose confidence in your responses. Also need to optimize for space vs speed.


Sergei Vassilvitskii (of Yahoo! Research) went over this first class of "COMS 6998-12: Dealing with Massive Data", http://www.cs.columbia.edu/~coms699812/ just a few weeks ago. Very interesting stuff. The class is dealing specifically with large data handling in both parallel processing and streaming techniques.

It looks as if a major component of the class is mathematical distillation of complex problems trading exact matching for acceptable probabilistic bounds. When you are able to do these math tricks you can cut an O(n^2) or worse problem down to O(n log n) or better problem which is the real win.


Good article. I like that it was more approachable than a journal paper -- as a developer, I am interested in these kind of algorithms but sometimes I don't even know where to begin searching. And as someone without a math PhD, I appreciate the simplifications and real-world applications.

I don't know what you have planned for the next article in the series, but I would recommend adding code snippets in a language of your choice. I would have liked to see your example implemented in python or ruby to make it more concrete.


Thanks.

I'm planning to discuss a couple of different techniques (using shingling to actually create the set elements from items is the next one) and then demonstrate them on a large dataset.

I'll start including some Python in the follow up ones.


Looks really, REALLY, similar to this article: http://knol.google.com/k/simple-simhashing

Hmmmm...


"Suppose you have a huge number of items that you would like to group together by a fuzzy notion of similarity. Suppose the only tool available to you is a key-value store. Suppose you only have the resources to consider each object once. Never fear, simhashing is here!"


"It's been raining all day and you find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you'd really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items). Clearly we don't want to scan our entire existing database of items every time we receive a new item but how do we avoid doing so? Minhashing to the rescue!"


Heh, you're right, it does look pretty similar. I wasn't aware of that article until someone posted it on Disqus comments just now.

The article came out of my reading the data mining slides from the Ullman - Stanford course, maybe there's common terminology? I know I tried to keep it consistent with the wikipedia Jaccard Index.


How much Jaccard similar ?



You can also find similar items using neural networks. There was an awesome talk about that at pycon 2009 : http://us.pycon.org/2009/conference/schedule/event/71/

Neural network data mining can also sometimes survive missing data or extrapolate from unknown instances.


You might want to take a look at http://code.google.com/p/scipy-cluster/ for some code in Python that allows systems like this to be built quickly.


Looking into it, I found a Ruby implementation:

https://github.com/jmhodges/minhash

Which depends on a Ruby implemenation of MurmurHash2:

https://github.com/jmhodges/murmur_hash

Anyone have any idea what the 23 is for?

  # 23 can be any unsigned 32-bit integer (i.e. from 0 to 2**32 - 1)
  hash_number = MurmurHash.murmur_hash("somestring", 23)


23 is the seed s/he's using for the hasing function it seems.


There are also a number of similarity search indexes available: http://obsearch.net/ http://www.sisap.org/Metric_Space_Library.html http://lshkit.sourceforge.net/


I'm sorry, but this article is poorly written. The Google Knol article, http://knol.google.com/k/simple-simhashing , is much better.

It is clear that the author does not understand what a hashing function is. He says, "Your programming language of choice's API will almost certainly have a few acceptable ones within arm's reach", but that's not true. Firstly, what is a programming language's "API" ? Secondly: most languages that I'm aware of have no builtin hash functions; they are usually libraries (I know, I'm getting pedantic here). And thirdly: most hash functions that you'd find in a library don't return a floating point value that you can operate on with a "min" function (or even an int value). They typically return a string of 40-, 128 or 256 bits (depending on the function). So you'll need to map the hash returned by your hash function to a value that you can compare; for example, by grabbing the last 4 bytes and treating them as an unsigned integer.


Its obvious he means standard library by 'API'. Also the type returned doesn't need to be an integer, or even a number. If you need to write a min function then the type merely needs to form a partially ordered set. A string could fulfill this requirement - simply compare an integer representation of the first character if they're equal move on to the following character. Alphabetical ordering, taught to children at all good primary schools, is a restricted version of this comparison algorithm.


Hmm, I know it's a bit of an esoteric language, but Java has a method called "hashCode()" which is common to all objects, and does indeed return an int value.

That's points 2 and 3 covered, I'll leave it up to you whether or not you want to split hairs on what's part of Java's "API" (a sensible reader would probably take that as the Java Class Library in the case of Java).





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: