I was once asked in an interview how I'd find a row in a relational database whe...

spdustin · on Aug 23, 2019

Depending on the RDBMS and the target text (ideally, you’re looking to compare an entire column to the search text, and in a known character set, and known casing, etc.), I’d take one of two approaches:

1) write a UDF—in C, or for SQL Server, in pretty much any managed language—that returned the Levenshtein distance between a passed param and a column name, with a “max distance” param to aid in optimizing (e.g., maybe it doesn’t bother calculating the distance if the length difference exceeds the max Levenshtein distance)

Then you could write a query that looks something like:

    SELECT product, description WHERE levenshtein(`fobar`, product, 2) <= 2

-or-

2) if you only care about a distance of 1 or 2, create a set of secondary tables where you’d pre-calculate the permutations of every word, with a FK pointing back to the PK of the actual row.

Alternately, a combination of a Levenshtein distance function and an RDBMS that already supports fuzzy text matching seems (to me) to be ideal.

sixdimensional · on Aug 23, 2019

I think sometimes it is easy to forget that SQL is but an abstraction over algorithmic processing and shaping of data. A UDF is only a way to extend the built in algorithms, albeit, one which may not be fully integrated with a given query engine architecture and therefore cannot be fully optimized without having access to the engines internals.

I would argue your answer is the best as as far as doing an actual Levenshtein distance calculation.

That being said, sort orders are implemented algorithmically by the query engine. ORDER BY in SQL Server, as I recall implements up to 7 different algorithms built in, depending on the case, as far as I recall. But those algorithms are optimized for the engine and have access to the internals.

I can think of two interesting ways to look at this, one being how one could use an open source database to extend the built in sort algorithms in the query engine, perhaps controlled by collation settings, to get a native levenshtein sort implemented that has access to internals. Collations defining the rules for sort order, but this would probably be within a column, not against a value (although if you’re extending the query engine, why not). Example of collation settings [1] and how they control the rules of sort order - perhaps a SET command to define levenshtein order and then a native extension to the Postgres query engine, and now you can sort a column by Levenshtein distances.

The other thing I was thinking was word vectors (but this is similar to Levenshtein distance) and still in a relational database requires something special. Maybe some of the full text search implementations in database could do this out of the box.

[1] https://docs.microsoft.com/en-us/sql/relational-databases/co...

astine · on Aug 22, 2019

I don't think that there is a way to index by levenshtein distance generally because there is a many-to-many mapping between search terms and keys to match and there is potentially an infinite number of either.

What I've seen databases that need to do fuzzy matching on strings do more often is index a list of substrings of keys and match on those. So hypothetically if I had a row identified with the string 'abc', I would have a separate table with the strings: 'a', 'b', 'c', 'ab', 'bc', 'abc' all mapped to the row in question and an index built on them.

If you had a cap on the edit distance you knew was acceptable for searches, you could do something similar with levenshtein distance, where you could save every string with an edit distance less than say, 3, to a separate table along with the edit distance. This would be expensive in terms of space though, if your keys were very large. I suppose you could optimize it by just pre-computing the distance on a prefix (say the first three or four characters) and using that to narrow the search space.

chopin · on Aug 23, 2019

Isn't that an n-gram search? I think postgres can do this.

astine · on Aug 23, 2019

I know that Elasticsearch does it natively. I'm not sure about Postgres. SQL Server is my main RDBMs currently.

massaman_yams · on Aug 23, 2019

yep - using the pg_trgrm extension. https://www.postgresql.org/docs/current/pgtrgm.html

janee · on Aug 22, 2019

Wouldn't call it 'good' but it would be better than applying L on each row.

Instead enumerate all unique permutations of the search term, calculate it's levensten distance and then search through for a match...doing it one by one starting from distance 0 to max and returning the first match.

If the variable in complexity size is the number of rows then it's better to enumerate all permutations and do a single search over all rows, ordering results by distance and returning only the first match

CapmCrackaWaka · on Aug 22, 2019

I wonder if the Wagner-Fischer algorithm could be vectorized using substring functions... It doesn't seem trivial. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...

rootusrootus · on Aug 22, 2019

Postgres has a function to calculate levenshtein distance.

AgalmicVentures · on Aug 23, 2019

Levenshtein distance is a metric [1] and discrete metrics can be indexed with BK trees, taking advantage of the triangle inequality. For a detailed explanation, see [2].

I know of no real world implementation of this, however.

[1] https://en.wikipedia.org/wiki/Metric_(mathematics) [2] http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...

pama · on Aug 22, 2019

If you only care about the row with the shortest distance (and assuming no precomputed optimizations) you can save compute cycles by exiting the algorithm as soon as the distance is guaranteed to be larger or equal to the shortest distance thus far. For very large searches the algorithmic complexity drops to roughly linear in the number of rows, the length of the search string, and the shortest distance thus far (the distance to any strings whose length differs by more than the shortest distance thus far need not be computed at all).

msvan · on Aug 22, 2019

Not sure how it's supposed to be solved, but I thought of a cute hack in the special case where the character set of the strings is limited, e.g. lowercase ASCII only.

Let A be the alphabet used. Precompute the "letter sum" of each string and store it. Then given two strings with letter sums c1 and c2, they cannot have Levenshtein distance k if |c1-c2| > k|A|. This could allow you to skip a lot of rows.

mceachen · on Aug 22, 2019

I may be misunderstanding you, but an orderless sum of characters in a string won't be an effective prefilter for any string similarity algo.

Think "atc" and "cat". Same sum.

msvan · on Aug 23, 2019

The sum forms a necessary but not sufficient condition for being within a certain Levenshtein distance. In your example, the inequality I gave above does not apply since |c1-c2| = 0. You would have to calculate the Levenshtein distance. In cases where the inequality is satisfied, you do not have to calculate the Levenshtein distance at all.

The idea is that any edit operation on a string will at most change the letter sum by |A|.

kelvich · on Aug 22, 2019

That may be a wrong question to ask. If do not need exact levenshtein distance, but some measure of similarity then there are a lot of options. For example postgres has index-supported similarity based on amount of common trigrams.

https://www.postgresql.org/docs/11/pgtrgm.html

bgorman · on Aug 23, 2019

yeah pgtrgm works really well for detecting typos or almost-matches.

tshaddox · on Aug 22, 2019

I can't think of any clever optimizations, unless there are some convenient constraints on the database or search query values. You could just precompute a lot of stuff to help you narrow down the search, but that's really not going to be feasible unless you have some constraints or some metrics on what sort of searches are being performed.

visarga · on Aug 22, 2019

Suffix trees are a cool algorithm for fuzzy text matching. There is a linux tool called 'sary' that implements approximate grep, but it takes some time and space to convert plain text to suffix array format.

http://sary.sourceforge.net/

pandalicious · on Aug 22, 2019

sort/index the db table by string length, start your search with the closest length strings (bigger and smaller) and then stop if you find a levenshtein distance that's smaller than that the length difference between the key and the next closest length string? In many(most?) cases you'd still end up searching the whole table but at least in some cases you could end the search early.

nerfhammer · on Aug 22, 2019

use shingles

https://en.wikipedia.org/wiki/W-shingling

zX41ZdbW · on Aug 24, 2019

BTW, this algorithm is implemented in ClickHouse for fuzzy string matching employing rather clever technique for SIMD optimization: https://github.com/yandex/ClickHouse/blob/a9cfe4ce91b5cdfedb...