Damn Cool Algoritms: BK-Trees

lpolovets · on March 3, 2010

Cool idea.

Both the article and other sites (https://issues.apache.org/jira/browse/LUCENE-2230 is an example) suggest that 3-20x performance improvements are typical.

I've worked on this problem before, and it seems like there are also simpler (and less memory intensive) ways of speeding up the computation. For example you can use properties like string lengths (or character distributions within strings) to get results that are just as good if not better. If you have a corpus of 1 million strings and you want to find all strings within an edit distance of 2 to a query, you can automatically eliminate any strings among the corpus whose length differs by > 2 from the length of the query string. Also, if you are willing to store 1-2 bytes of metadata about each string, you can use that metadata to very quickly generate lower bounds on edit distances, avoiding the O(N^2) distance computation whenever the lower bound is greater than your max tolerable distance. A good example of metadata would be to store (for each string in the corpus) the number of characters with ascii codes that are 0 mod 5, 1 mod 5, ... 4 mod 5.

rubyrescue · on March 3, 2010

What information about 'character distributions within strings' would be useful?

the idea of Levenshtein Distance as constituting a metric space is something i've never thought of. I implemented a simple "Did You Mean?" using L.D. in Ruby/C but didn't build this type of structure and now that i've read this it seems obvious.

edit: clarification

lpolovets · on March 3, 2010

Let's say you plan to keep 2 bytes/16 bits of metadata for each string in the corpus. You can use these 16 bits to store five 3-bit counters for the number of characters in a string whose ascii values are 0, 1, ... 4 mod 5. Each counter can go up to 7 (0x111), at which point further increments are no-ops (so a string with 7 characters that have an ascii value of 0 mod 5 will have the same value in the first counter as a string with 20 characters that have an ascii value of 0 mod 5).

When you want to see if the Levenshtein distance between two strings is less than some threshold (say 2), you first compare the two sets of 5 counters. The difference in values for corresponding counters can help you determine a lower bound on the Levenshtein distance. For example, if the first string has six characters that are 1 mod 5, and the second string has 2 characters that are 1 mod 5, then your edit distance will be at least 2, so if you were only looking for strings that are within a distance of 1, you no longer need to bother computing Levenshtein for your two strings. Comparing counters is just a few bit operations, so it's super fast.

chronomex · on March 3, 2010

That reminds me of the Boyer-Moore string search algorithm. One way to implement B-M is with a bitmap of the characters in the query string, so that regions of the search space that can't possibly match the query will be skipped efficiently.

stromhold · on March 3, 2010

The quick-select algorithm is a really simple and effective way of selecting the k-th smallest element and one that would be nice to cover. It's especially useful in situations where you need O(n) selection of the median element.

eru · on March 4, 2010

Yes. And you can unify quick-select and tree-sort (i.e. quicksort, but not in-place) in a lazy language easily in the same algorithm.

jbw · on March 3, 2010

Quad trees would be a cool algorithm to cover.

varikin · on March 3, 2010

There is one that includes quad trees.

http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-...

I haven't read it yet though.

jbw · on March 4, 2010

Nice, thanks.

I'll have to look around the site a bit more.