Hacker News new | past | comments | ask | show | jobs | submit login

I was once asked in an interview how I'd find a row in a relational database where the value in one column was the shortest levenshtein distance away from a search key, and I was absolutely stumped when I realized I didn't really know a sort/search that optimized that way. Of course, by the time I realized I could email the interviewer for the answer an awkward amount of time had passed.

So now I ask HN: is there any "good" way to do this short of some kind of answer like "find a database that natively offers levenshtein distance search"? Are there actually sorting algorithms for sorting by levenshtein distance (other than just using L as a sort function in a traditional sort function)? Are there search algorithms that work well for searching by L distance?




Depending on the RDBMS and the target text (ideally, you’re looking to compare an entire column to the search text, and in a known character set, and known casing, etc.), I’d take one of two approaches:

1) write a UDF—in C, or for SQL Server, in pretty much any managed language—that returned the Levenshtein distance between a passed param and a column name, with a “max distance” param to aid in optimizing (e.g., maybe it doesn’t bother calculating the distance if the length difference exceeds the max Levenshtein distance)

Then you could write a query that looks something like:

    SELECT product, description WHERE levenshtein(`fobar`, product, 2) <= 2
-or-

2) if you only care about a distance of 1 or 2, create a set of secondary tables where you’d pre-calculate the permutations of every word, with a FK pointing back to the PK of the actual row.

Alternately, a combination of a Levenshtein distance function and an RDBMS that already supports fuzzy text matching seems (to me) to be ideal.


I think sometimes it is easy to forget that SQL is but an abstraction over algorithmic processing and shaping of data. A UDF is only a way to extend the built in algorithms, albeit, one which may not be fully integrated with a given query engine architecture and therefore cannot be fully optimized without having access to the engines internals.

I would argue your answer is the best as as far as doing an actual Levenshtein distance calculation.

That being said, sort orders are implemented algorithmically by the query engine. ORDER BY in SQL Server, as I recall implements up to 7 different algorithms built in, depending on the case, as far as I recall. But those algorithms are optimized for the engine and have access to the internals.

I can think of two interesting ways to look at this, one being how one could use an open source database to extend the built in sort algorithms in the query engine, perhaps controlled by collation settings, to get a native levenshtein sort implemented that has access to internals. Collations defining the rules for sort order, but this would probably be within a column, not against a value (although if you’re extending the query engine, why not). Example of collation settings [1] and how they control the rules of sort order - perhaps a SET command to define levenshtein order and then a native extension to the Postgres query engine, and now you can sort a column by Levenshtein distances.

The other thing I was thinking was word vectors (but this is similar to Levenshtein distance) and still in a relational database requires something special. Maybe some of the full text search implementations in database could do this out of the box.

[1] https://docs.microsoft.com/en-us/sql/relational-databases/co...


I don't think that there is a way to index by levenshtein distance generally because there is a many-to-many mapping between search terms and keys to match and there is potentially an infinite number of either.

What I've seen databases that need to do fuzzy matching on strings do more often is index a list of substrings of keys and match on those. So hypothetically if I had a row identified with the string 'abc', I would have a separate table with the strings: 'a', 'b', 'c', 'ab', 'bc', 'abc' all mapped to the row in question and an index built on them.

If you had a cap on the edit distance you knew was acceptable for searches, you could do something similar with levenshtein distance, where you could save every string with an edit distance less than say, 3, to a separate table along with the edit distance. This would be expensive in terms of space though, if your keys were very large. I suppose you could optimize it by just pre-computing the distance on a prefix (say the first three or four characters) and using that to narrow the search space.


Isn't that an n-gram search? I think postgres can do this.


I know that Elasticsearch does it natively. I'm not sure about Postgres. SQL Server is my main RDBMs currently.


yep - using the pg_trgrm extension. https://www.postgresql.org/docs/current/pgtrgm.html


Wouldn't call it 'good' but it would be better than applying L on each row.

Instead enumerate all unique permutations of the search term, calculate it's levensten distance and then search through for a match...doing it one by one starting from distance 0 to max and returning the first match.

If the variable in complexity size is the number of rows then it's better to enumerate all permutations and do a single search over all rows, ordering results by distance and returning only the first match


I wonder if the Wagner-Fischer algorithm could be vectorized using substring functions... It doesn't seem trivial. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...


Postgres has a function to calculate levenshtein distance.


Levenshtein distance is a metric [1] and discrete metrics can be indexed with BK trees, taking advantage of the triangle inequality. For a detailed explanation, see [2].

I know of no real world implementation of this, however.

[1] https://en.wikipedia.org/wiki/Metric_(mathematics) [2] http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...


If you only care about the row with the shortest distance (and assuming no precomputed optimizations) you can save compute cycles by exiting the algorithm as soon as the distance is guaranteed to be larger or equal to the shortest distance thus far. For very large searches the algorithmic complexity drops to roughly linear in the number of rows, the length of the search string, and the shortest distance thus far (the distance to any strings whose length differs by more than the shortest distance thus far need not be computed at all).


Not sure how it's supposed to be solved, but I thought of a cute hack in the special case where the character set of the strings is limited, e.g. lowercase ASCII only.

Let A be the alphabet used. Precompute the "letter sum" of each string and store it. Then given two strings with letter sums c1 and c2, they cannot have Levenshtein distance k if |c1-c2| > k|A|. This could allow you to skip a lot of rows.


I may be misunderstanding you, but an orderless sum of characters in a string won't be an effective prefilter for any string similarity algo.

Think "atc" and "cat". Same sum.


The sum forms a necessary but not sufficient condition for being within a certain Levenshtein distance. In your example, the inequality I gave above does not apply since |c1-c2| = 0. You would have to calculate the Levenshtein distance. In cases where the inequality is satisfied, you do not have to calculate the Levenshtein distance at all.

The idea is that any edit operation on a string will at most change the letter sum by |A|.


That may be a wrong question to ask. If do not need exact levenshtein distance, but some measure of similarity then there are a lot of options. For example postgres has index-supported similarity based on amount of common trigrams.

https://www.postgresql.org/docs/11/pgtrgm.html


yeah pgtrgm works really well for detecting typos or almost-matches.


I can't think of any clever optimizations, unless there are some convenient constraints on the database or search query values. You could just precompute a lot of stuff to help you narrow down the search, but that's really not going to be feasible unless you have some constraints or some metrics on what sort of searches are being performed.


Suffix trees are a cool algorithm for fuzzy text matching. There is a linux tool called 'sary' that implements approximate grep, but it takes some time and space to convert plain text to suffix array format.

http://sary.sourceforge.net/


sort/index the db table by string length, start your search with the closest length strings (bigger and smaller) and then stop if you find a levenshtein distance that's smaller than that the length difference between the key and the next closest length string? In many(most?) cases you'd still end up searching the whole table but at least in some cases you could end the search early.



BTW, this algorithm is implemented in ClickHouse for fuzzy string matching employing rather clever technique for SIMD optimization: https://github.com/yandex/ClickHouse/blob/a9cfe4ce91b5cdfedb...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: