Damn Cool Algorithms: Levenshtein Automata

sqrt17 · on July 29, 2010

Actually, if you want to see whether the edit distance (Levenshtein) is within a fixed bound, say k, it's no longer O(mn), but around O(kn) (since you're only ever filling a narrow band of width ~k along the diagonal instead of the whole rectangle).

Nonetheless, the DFA construction is priceless if you want flexible matching against a dictionary.

viggity · on July 29, 2010

I used Levenshtein Automata when I wrote Atomiq Duplicate Code Finder (http://www.getatomiq.com). It is some pretty cool stuff!

Confusion · on July 29, 2010

Something I wonder about sometimes: how did you know about the existence of this algorithm? Do you:

- Simply know about it, because it was mentioned once, somewhere in a book on algorithms or a class and you remembered it when you knew what you needed?

- Knew what you needed and searched for it. If so, how do you know where to search and what terms do you use?

- Rediscover the algorithm, because you knew what you needed, implemented it and recognized it only later?

I don't have a formal CS background and I sometimes get stuck on: I know what I need and I'm sure someone has implemented it before (and better), but I don't know how to get at that information.

sqrt17 · on July 30, 2010

There's several central ideas in CS that most people know. (Not everyone knows every single of them, but most people know many of them). One of them is the idea of finite-state automata. If you have a finite set of strings, or a regular language, you can build a DFA to recognize that.

Once you know know about DFAs, and you realize that there are only finitely many words that have a Levenshtein distance of less than k, it's just a question of (i) practicality and (ii) technical details. But when you have the idea to combine concepts X and Y, you know what terms to search for and what papers you want to look at.

As to practicality, there's potentially many words that have a Levenshtein distance of at most two for some word, but you don't care about the bits that are different, so the automaton is much smaller.

Part of being a successful CS background is to know all the kinds of hammers that are around and having a feel for the right hammer to use given a particular nail.

In this case, the hammers are abstractions. Algorithms are fundamentally driven by abstractions into which you then fill necessary technical detail - getting the technical details wrong may leave you with a correct but inefficient algorithm at worst (but enable you to recognize a more efficient way when you come across it), whereas you're bound to reinvent a new duct-taped makeshift wheel wherever you go when you don't know the right abstractions.

Confusion · on July 30, 2010

Thanks for your explanation. It's a reasonably comforting answer, since it suggests that as long as I spend time picking up concepts (which I have been doing ever since I went into software engineering) and connecting my problems to those concepts, there is a reasonable chance I will stumble upon a beautiful solution like a Levenshtein automaton, if it exists for the problem at hand.

viggity · on July 30, 2010

I read a lot, that is how I know about most of the algorithms in my arsenal. But in this particular instance we dealt with Levenshtein Distance in a cryptography graduate class at Iowa State and I've used it in a couple other places other than Atomiq.

I'd recommend reading as much as you can. Go out to wikipedia and start with the article on Donald Knuth, anything cool worth noting about algorithms isn't more than 2 or 3 links removed from his page.

chime · on July 29, 2010

Atomiq is awesome! Could you make it work with PHP files (.php, .class, .inc)?

viggity · on July 29, 2010

:) We've been getting a lot of requests for PHP, it is next on our list.

moultano · on July 29, 2010

Saving this to read for later. I just wanted to say in advance that it's really refreshing to see a blog so well designed that I feel like I've _already_ clicked my Readability bookmarklet when I first arrive.

eru · on July 29, 2010

Hmm, the first pictures (that floats right from the text) touches the text. That's a bit ugly. Otherwise, it's a nice and readable design.

caf · on July 29, 2010

This is one of those algorithms that's so clever that the end result almost feels like magic.

torpor · on July 29, 2010

Great article, and as it happens right now I'm using this algorithm in a hardware MIDI sequencer I'm working on .. it works very well for phrase-based recognition of patterns in a MIDI context, also ..

Confusion · on July 29, 2010

I've posed the same question to viggity, but in case you only see this in your 'threads' view: how did you know about the existence of this algorithm? Do you:

- Simply know about it, because it was mentioned once, somewhere in a book on algorithms or a class and you remembered it when you knew what you needed?

- Knew what you needed and searched for it. If so, how do you know where to search and what terms do you use?

- Rediscover the algorithm, because you knew what you needed, implemented it and recognized it only later?

I don't have a formal CS background and I sometimes get stuck on: I know what I need and I'm sure someone has implemented it before (and better), but I don't know how to get at that information.

torpor · on Aug 3, 2010

I first encountered the Levenshtein distance at work (where I am part of a team developing SIL-4 level operating systems for use in the transportation industry), as part of our coding rules which describe the differences in variable names allowed (e.g. "char fic; char foc; // ") .. After I looked up the algorithm on my own intuition, I realized it had applicability to MIDI sequences in unique ways, and thats how it came about.