Hacker News new | past | comments | ask | show | jobs | submit login

I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:

1. Concatenate and split all text fields into an array of ngrams (2...n chars)

2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.

4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.

In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: