Hacker News new | past | comments | ask | show | jobs | submit login

Very nice! If you're looking for one more variant to add: LD with a limit on the acceptable distance can be done more efficiently, and is nearly always what you want in practice.

I wrote a commented version years back that might be a handy reference; it's since been deprecated because we moved it out of StringUtils, but the original code is here https://github.com/apache/commons-lang/blob/master/src/main/...




> This implementation only computes the distance if it's less than or equal to the threshold value, returning -1 if it's greater. The advantage is performance: unbounded distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1 of the cost table. It is also possible to use this to compute the unbounded Levenshtein distance by starting the threshold at 1 and doubling each time until the distance is found; this is O(dm), where d is the distance.

That's pretty cool, especially the doubling scheme. I'm using a modified form of Levenshtein Distance for comparing lines when diffing files, and that's pretty expensive since code files that are thousands of lines long are not uncommon. Since you are usually comparing one file to another version of itself, the differences are often small though, so an incremental approach would really pay off.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: