Hacker News new | past | comments | ask | show | jobs | submit login
Soundex – a phonetic algorithm for indexing names by sound (wikipedia.org)
106 points by jstrieb on Oct 20, 2019 | hide | past | favorite | 41 comments



Metaphone is a much better alternative to Soundex for phonetic indexing of English words. Phonetic indexing is an interesting concept, and inspired by Metaphone, I was able to write algorithms for Malayalam[1] and Kannada[2], two Dravidian (Indic) languages. They work well for searches and suggestions.

[1] https://github.com/knadh/mlphone

[2] https://github.com/knadh/knphone


I've tried using Soundex in practice, and I found it to be too coarse-grained to be really useful. Metaphone is much better, even if it's more complex.


Thank you, I am a frequent user of your olam dictionary. Is there a JavaScript port of mlphone available ?


Thanks :) There isn't one in JS, but it should be really easy to port.


Anyone interested in this should also look at the metaphone and double metaphone algorithms. Soundex handles American English names decently well but metaphone and double metaphone work better for more general cases


Also, Kölner Phonetik if you work in a German-language context.



It is important to distinguish two use cases of phonetic algorithms:

- Recovering English spelling from English pronunciation. Soundex and Metaphone are good at this.

- Undoing the torments of Eastern-European sibilants in various transliterations of surnames, e.g. Tchaikovsky, Tschaikowski, Tchaïkovski, Ciajkovskij, Tsjaikovski, Tjajkovskij, Tsjajkovskij, Csajkovszkij, and Czajkowski. Here, Double Metaphone or Daitch–Mokotoff Soundex do a better job.

[1] https://en.wikipedia.org/wiki/Metaphone [2] https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Sounde...


The second category essentially concerns spelling variations. Years and years ago I built a tools for searching old documents and to deal with all the spelling variants these contain I implemented something called the Gloria Guts algorithm, which ranks words based on how much their spelling differs. As I recall it worked much better than sounded for our data set


Or Beider-Morse.

Solr, for example, supports quite a few phonetic matching algorithms by default[1].

[1]: https://lucene.apache.org/solr/guide/7_4/phonetic-matching.h...


> as pronounced in English

I was curious about whether or not there was an IPA version (which would then work for any language) -- one of the first results I found was Eudex, which looks a little different (since it's a hash of the pronunciation) but more versatile.

https://github.com/ticki/eudex


But how do you map from written form to IPA phonemes? For instance french is going to have a big problem there. Foie /fwa/ is a good example as has o, i and e written, but you pronounce a and some sort of u...


Why? All you need is the rules that govern French writing. They clearly exist, else humans couldn't read French either. It can't be harder than English where the rules amount to "look it up in the dictionary".


eSpeak (http://espeak.sourceforge.net/) can map text to IPA:

  $ espeak -q --ipa -v fr foie
   fwˈa



After using this, metaphone and other interesting similarity metrics for data cleaning in OpenRefine, I was looking for a python library that had these sorts of similarity metrics.

I found this quite interesting (and comprehensive) project that includes regular string distance (fuzzy search) methods, fingerprinting methods and a huge number of phonetic metrics including 7 different soundex methods: https://github.com/chrislit/abydos

A few others worth mentioning:

Scala: https://github.com/rockymadden/stringmetric

Java/Scala: https://github.com/vickumar1981/stringdistance

Postgres: https://github.com/eulerto/pg_similarity

R: https://github.com/markvanderloo/stringdist

Clojure: https://github.com/Yomguithereal/clj-fuzzy


During my graduate degree, I was looking into phonetic search, and a lot of the papers I stumbled across were using an expanded version of soundex called Phonix[1] as their basis of new algorithms. However, I had a lot of trouble finding an implementation I could use, so I implemented it in Python.

After that I build a search engine programme that would look up swear words that were phonetically similar to input terms, as I thought that my fellow students might get a laugh looking up their own names. In the end, neither Phonix, double metaphone nor Soundex really produced any funny results.

plugs:

- Blogpost: http://olsgaard.dk/phonixpy-phonetic-name-search-in-python.h...

- Github repo: https://github.com/olsgaard/phonetic_search

[1] Gadd, T. N. “‘Fisching Fore Werds’: Phonetic Retrieval of Written Text in Information Systems.” Program 22, no. 3 (1988): 222–37.


I learned about this when digging through the PHP documentation. They have the function built-in https://www.php.net/manual/en/function.soundex.php

In a similar vein, Levenshtein distance is also built-in which is very useful for fuzzy searching https://www.php.net/manual/en/function.levenshtein.php


I once built a search engine for web agencies that you could search through by the technologies they worked with. All I did to make it fast was index all the data for a company into a few blobs and then fuzzy search the text blobs with Lucene, which used Levenshtein distances under the hood. It was lightening fast, stupid simple and all hosted on the same VM.


PostgreSQL also comes with a “fuzzymatch” module, which includes both Soundex and Levenshtein functions, and also Metaphone and Double Metaphone:

https://www.postgresql.org/docs/current/fuzzystrmatch.html


Clipper had SOUNDEX function, which was used to search in flat-file databases for descriptive names. SEEK SOUNDEX("Smith") and SEEK SOUNDEX("Smythe") would return the same result. https://www.itlnet.net/programming/program/Reference/c53g01c...

But Harbour, an open-source implementation of Clipper superseded it with Metaphone - https://harbour.github.io/doc/hbnf.html#ft_metaph


I once trained unsupervised character embeddings and used levenshtein distance with embedding cosine as character replacement weights. And it worked better as a similarity metric than soundex


Sounds interesting. Can you expand on what you mean by "with embedding cosine as character replacement weights"?


Edit distance allows for insertion, deletion and substitution. Some allow custom cost for substitution (e.g. characters closer together on a keyboard have lower substitution cost than other; for typo normalization). What I describe in my comment is to use the cosine similarity between learned character embeddings as substitution costs.


I’m always annoyed that spellcheck/correction on iOS is really flawed in that is doesn’t try to take a guess at phonetic matching so you can’t just guess at how something is spelled phonetically you often have to have the first X letters correct.


Here is a python implementation of the same by me - https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...


And I have an implementation in Lua/LPEG: https://github.com/spc476/LPeg-Parsers/blob/master/soundex.l...


I’ve found phonetic similarly algorithms like this to be very useful for data cleaning work, especially when your data generating process involves transliteration. That is, if your users know Language A but your data involves Language B, your users will record terms from Language B with highly varied spellings that nonetheless sound very similar when pronounced. Phonetic similarity algorithms help you cluster these spellings together by sound so you can map them to a canonical term from Language B.


When I implemented Soundex, an extension I did was to also return alternative hashes, for skipping of common surname prefixes. I didn't end up using it in production, so I can't say whether it would've helped, in practice.

https://www.neilvandyke.org/racket/soundex/#%28def._%28%28li...


Interestingly, `soundex` is bundled along with the standard functions in most commercial software. In one of my first search function I wrote, I used `soundex` to run against previous search words and suggest a known search word as 'did you mean?' to catch the spelling errors. This is around 2008-2010, and was an easy way out of what Google search was doing at that time. It was quite useful.


This is used by several states to assign driver license numbers. [1]

[1] https://www.community-insurance.com/learn-how-to-read-your-d...


I wonder why Wisconsin went through the trouble vs some incrementing id. It seems like it was understood 2 people could end up with the same number so the last two digits were added. That would rule out the benefit of unique ids without requiring a single central system to ensure uniqueness.

It does make it harder to fake an id, perhaps that was the motivation.

Edit: Some more interesting info for the states that implement this. Also, TIL Illinois drivers can appear to have the exact same DL number. http://www.highprogrammer.com/alan/numbers/dl_us_shared.html


It seems that something like this is in play in the Hacker News search. The first results will be alike, but the rest are using some soundex-like algorithm.



Used this in my last project. If you're curious, a lot of databases support Soundex searches natively.


Why is this only for "names" and not words or even phrases in general?


Based off of whose pronounciation?


It's the first sentence of the article:

> Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.


Yes, but whose pronunciation? Barbara, for instance, can be pronounced with two syllables or three.


It works by favoring false positives over false negatives. In the specific case of Barbara, it doesn't matter how many syllables, because non-initial vowels are ignored.


You do realize that there are multiple dialects of English[0], some of which differ wildly from the common ones. So again, whose pronounciation?

[0] https://en.m.wikipedia.org/wiki/List_of_dialects_of_English




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: