Soundex – a phonetic algorithm for indexing names by sound

knadh · on Oct 20, 2019

Metaphone is a much better alternative to Soundex for phonetic indexing of English words. Phonetic indexing is an interesting concept, and inspired by Metaphone, I was able to write algorithms for Malayalam[1] and Kannada[2], two Dravidian (Indic) languages. They work well for searches and suggestions.

[1] https://github.com/knadh/mlphone

[2] https://github.com/knadh/knphone

larsga · on Oct 20, 2019

I've tried using Soundex in practice, and I found it to be too coarse-grained to be really useful. Metaphone is much better, even if it's more complex.

geektips · on Oct 20, 2019

Thank you, I am a frequent user of your olam dictionary. Is there a JavaScript port of mlphone available ?

knadh · on Oct 20, 2019

Thanks :) There isn't one in JS, but it should be really easy to port.

simcop2387 · on Oct 20, 2019

Anyone interested in this should also look at the metaphone and double metaphone algorithms. Soundex handles American English names decently well but metaphone and double metaphone work better for more general cases

DocTomoe · on Oct 20, 2019

Also, Kölner Phonetik if you work in a German-language context.

jpalomaki · on Oct 20, 2019

In Wikipedia: https://en.wikipedia.org/wiki/Metaphone

mci · on Oct 20, 2019

It is important to distinguish two use cases of phonetic algorithms:

- Recovering English spelling from English pronunciation. Soundex and Metaphone are good at this.

- Undoing the torments of Eastern-European sibilants in various transliterations of surnames, e.g. Tchaikovsky, Tschaikowski, Tchaïkovski, Ciajkovskij, Tsjaikovski, Tjajkovskij, Tsjajkovskij, Csajkovszkij, and Czajkowski. Here, Double Metaphone or Daitch–Mokotoff Soundex do a better job.

[1] https://en.wikipedia.org/wiki/Metaphone [2] https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Sounde...

mtts · on Oct 20, 2019

The second category essentially concerns spelling variations. Years and years ago I built a tools for searching old documents and to deal with all the spelling variants these contain I implemented something called the Gloria Guts algorithm, which ranks words based on how much their spelling differs. As I recall it worked much better than sounded for our data set

mhd · on Oct 20, 2019

Or Beider-Morse.

Solr, for example, supports quite a few phonetic matching algorithms by default[1].

[1]: https://lucene.apache.org/solr/guide/7_4/phonetic-matching.h...

sdrothrock · on Oct 20, 2019

> as pronounced in English

I was curious about whether or not there was an IPA version (which would then work for any language) -- one of the first results I found was Eudex, which looks a little different (since it's a hash of the pronunciation) but more versatile.

https://github.com/ticki/eudex

harperlee · on Oct 20, 2019

But how do you map from written form to IPA phonemes? For instance french is going to have a big problem there. Foie /fwa/ is a good example as has o, i and e written, but you pronounce a and some sort of u...

anoncake · on Oct 20, 2019

Why? All you need is the rules that govern French writing. They clearly exist, else humans couldn't read French either. It can't be harder than English where the rules amount to "look it up in the dictionary".

jwilk · on Oct 20, 2019

eSpeak (http://espeak.sourceforge.net/) can map text to IPA:

  $ espeak -q --ipa -v fr foie
   fwˈa

asaibx · on Oct 20, 2019

This might help:

https://github.com/open-dict-data/ipa-dict

faizshah · on Oct 20, 2019

After using this, metaphone and other interesting similarity metrics for data cleaning in OpenRefine, I was looking for a python library that had these sorts of similarity metrics.

I found this quite interesting (and comprehensive) project that includes regular string distance (fuzzy search) methods, fingerprinting methods and a huge number of phonetic metrics including 7 different soundex methods: https://github.com/chrislit/abydos

A few others worth mentioning:

Scala: https://github.com/rockymadden/stringmetric

Java/Scala: https://github.com/vickumar1981/stringdistance

Postgres: https://github.com/eulerto/pg_similarity

R: https://github.com/markvanderloo/stringdist

Clojure: https://github.com/Yomguithereal/clj-fuzzy

olsgaarddk · on Oct 20, 2019

During my graduate degree, I was looking into phonetic search, and a lot of the papers I stumbled across were using an expanded version of soundex called Phonix[1] as their basis of new algorithms. However, I had a lot of trouble finding an implementation I could use, so I implemented it in Python.

After that I build a search engine programme that would look up swear words that were phonetically similar to input terms, as I thought that my fellow students might get a laugh looking up their own names. In the end, neither Phonix, double metaphone nor Soundex really produced any funny results.

plugs:

- Blogpost: http://olsgaard.dk/phonixpy-phonetic-name-search-in-python.h...

- Github repo: https://github.com/olsgaard/phonetic_search

[1] Gadd, T. N. “‘Fisching Fore Werds’: Phonetic Retrieval of Written Text in Information Systems.” Program 22, no. 3 (1988): 222–37.

francislavoie · on Oct 20, 2019

I learned about this when digging through the PHP documentation. They have the function built-in https://www.php.net/manual/en/function.soundex.php

In a similar vein, Levenshtein distance is also built-in which is very useful for fuzzy searching https://www.php.net/manual/en/function.levenshtein.php

ehnto · on Oct 20, 2019

I once built a search engine for web agencies that you could search through by the technologies they worked with. All I did to make it fast was index all the data for a company into a few blobs and then fuzzy search the text blobs with Lucene, which used Levenshtein distances under the hood. It was lightening fast, stupid simple and all hosted on the same VM.

teddyh · on Oct 20, 2019

PostgreSQL also comes with a “fuzzymatch” module, which includes both Soundex and Levenshtein functions, and also Metaphone and Double Metaphone:

https://www.postgresql.org/docs/current/fuzzystrmatch.html

jasim · on Oct 20, 2019

Clipper had SOUNDEX function, which was used to search in flat-file databases for descriptive names. SEEK SOUNDEX("Smith") and SEEK SOUNDEX("Smythe") would return the same result. https://www.itlnet.net/programming/program/Reference/c53g01c...

But Harbour, an open-source implementation of Clipper superseded it with Metaphone - https://harbour.github.io/doc/hbnf.html#ft_metaph

Raphaellll · on Oct 20, 2019

I once trained unsupervised character embeddings and used levenshtein distance with embedding cosine as character replacement weights. And it worked better as a similarity metric than soundex

ahnick · on Oct 20, 2019

Sounds interesting. Can you expand on what you mean by "with embedding cosine as character replacement weights"?

Raphaellll · on Oct 20, 2019

Edit distance allows for insertion, deletion and substitution. Some allow custom cost for substitution (e.g. characters closer together on a keyboard have lower substitution cost than other; for typo normalization). What I describe in my comment is to use the cosine similarity between learned character embeddings as substitution costs.

andy_ppp · on Oct 20, 2019

I’m always annoyed that spellcheck/correction on iOS is really flawed in that is doesn’t try to take a guess at phonetic matching so you can’t just guess at how something is spelled phonetically you often have to have the first X letters correct.

codemaniac · on Oct 20, 2019

Here is a python implementation of the same by me - https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...

spc476 · on Oct 20, 2019

And I have an implementation in Lua/LPEG: https://github.com/spc476/LPeg-Parsers/blob/master/soundex.l...

edraferi · on Oct 20, 2019

I’ve found phonetic similarly algorithms like this to be very useful for data cleaning work, especially when your data generating process involves transliteration. That is, if your users know Language A but your data involves Language B, your users will record terms from Language B with highly varied spellings that nonetheless sound very similar when pronounced. Phonetic similarity algorithms help you cluster these spellings together by sound so you can map them to a canonical term from Language B.

neilv · on Oct 20, 2019

When I implemented Soundex, an extension I did was to also return alternative hashes, for skipping of common surname prefixes. I didn't end up using it in production, so I can't say whether it would've helped, in practice.

https://www.neilvandyke.org/racket/soundex/#%28def._%28%28li...

guruparan18 · on Oct 20, 2019

Interestingly, `soundex` is bundled along with the standard functions in most commercial software. In one of my first search function I wrote, I used `soundex` to run against previous search words and suggest a known search word as 'did you mean?' to catch the spelling errors. This is around 2008-2010, and was an easy way out of what Google search was doing at that time. It was quite useful.

hkchad · on Oct 20, 2019

This is used by several states to assign driver license numbers. [1]

[1] https://www.community-insurance.com/learn-how-to-read-your-d...

jimueller · on Oct 20, 2019

I wonder why Wisconsin went through the trouble vs some incrementing id. It seems like it was understood 2 people could end up with the same number so the last two digits were added. That would rule out the benefit of unique ids without requiring a single central system to ensure uniqueness.

It does make it harder to fake an id, perhaps that was the motivation.

Edit: Some more interesting info for the states that implement this. Also, TIL Illinois drivers can appear to have the exact same DL number. http://www.highprogrammer.com/alan/numbers/dl_us_shared.html

fortran77 · on Oct 20, 2019

It seems that something like this is in play in the Hacker News search. The first results will be alike, but the rest are using some soundex-like algorithm.

f2000 · on Oct 20, 2019

See https://en.wikipedia.org/wiki/Metaphone

Yhippa · on Oct 20, 2019

Used this in my last project. If you're curious, a lot of databases support Soundex searches natively.

hammock · on Oct 20, 2019

Why is this only for "names" and not words or even phrases in general?

knolax · on Oct 20, 2019

Based off of whose pronounciation?

danso · on Oct 20, 2019

It's the first sentence of the article:

> Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.

TylerE · on Oct 20, 2019

Yes, but whose pronunciation? Barbara, for instance, can be pronounced with two syllables or three.

aidenn0 · on Oct 20, 2019

It works by favoring false positives over false negatives. In the specific case of Barbara, it doesn't matter how many syllables, because non-initial vowels are ignored.

knolax · on Oct 20, 2019

You do realize that there are multiple dialects of English[0], some of which differ wildly from the common ones. So again, whose pronounciation?

[0] https://en.m.wikipedia.org/wiki/List_of_dialects_of_English