Hacker News new | past | comments | ask | show | jobs | submit login

The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by

(1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction

(2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user.

There's some benchmarking results here: https://www.robinlinacre.com/fast_deduplication/

Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: