One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.
The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.
I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.
Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...
For deduplicating people, the Fellegi Sunter model is also a powerful approach. Splink[0] is a free Python library that implements this for big datasets. Probably you could combine parts of both approaches as well. Full disclosure: I am the lead author.
I've also written up some interactive tutorials on how the method works [1] if anyone's interested
The Fellegi Sunter model is able to estimate the importance of different types of information from the data itself (i.e. unsupervised learning).
For instance, a match on a date of birth column lends a greater weight of evidence in favour of a match than a match on first name (since dob has higher cardinality).
The method is also able to estimate weights for fuzzy matches (how much evidence in favour of a match is close match on dob with one character difference), and also how much evidence against a match a mismatch is.
For instance, if you have very high data quality on gender, then a match on gender doesn't tell you much, but a mismatch on gender is quite strong evidence against the idea two records match.
It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.
The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by
(1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction
(2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user.
Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets.
There is also gaoya ( I am the author ), which is also written in Rust and has python bindings. datasketch is great, but it is not performant enough for my use case. gaoya is used in a production system for large scale clustering
https://github.com/serega/gaoya
Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.
The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.
Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.
So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.
This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.
In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows
a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.
you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"
enjoy!
edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.
It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.
Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.
There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim
I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?
I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:
1. Concatenate and split all text fields into an array of ngrams (2...n chars)
2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers
3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.
4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)
If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.
In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.
love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.
You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0).
You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?
Love the article! My team at nvidia recently released a GPU accelerated version of the fuzzy deduplication algorithm described, and I figure this community might be interested.
This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.
I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.
i think the key insight, at least for me, is that if you break things into piles of lots of little pieces and then you come up with n different ways of ordering those piles, similar things are going to have the same piece show up on top across those orderings.
add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.
As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?
Using an LLM is just one of the ways to generate embedding. To do k-means you still need to pick a distance function, like jaccard; k-means is probably not ideal for near duplicates, and you can use min-hash to speed up k-means as a pre-pass.
I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.
I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.
Each time you embed a document you search for approximate nearest neighbours before adding it, so it is O(N) like MinHash.
Vector indexes like HNSW and PQ have better performance/quality tradeoffs than SimHash LSH which is the analogue of MinHash for cosine distance.
The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.
The main drawback is the additional cost of embedding all the documents, especially for longer documents. But this cost has dropped really quickly with smaller models, better optimisations, and faster hardware.
If this is true, then why do you think that — as the OP article states — the developers of GPT3 chose to use non-ML-based techniques to deduplicate their dataset, when they would be the most equipped to use ML-based approaches?
Just the pure compute cost of needing to run an ML encoder over petabytes of data?
Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)
It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}
That last S(A,B) should be S(A,C).
(Unless I'm very much mistaken ... and I could be)
I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).
there was a really simple string sort by similarity oneliner I can't find in my history, but the basic idea was to take a wordlist, split each word in half, reverse the halves, concatenate them back together, sort the resulting strings, then flip the strings back to their original words.
this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.
Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?
The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.
I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.
Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...
[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...