Metaphone is a much better alternative to Soundex for phonetic indexing of English words. Phonetic indexing is an interesting concept, and inspired by Metaphone, I was able to write algorithms for Malayalam[1] and Kannada[2], two Dravidian (Indic) languages. They work well for searches and suggestions.
I've tried using Soundex in practice, and I found it to be too coarse-grained to be really useful. Metaphone is much better, even if it's more complex.
Anyone interested in this should also look at the metaphone and double metaphone algorithms. Soundex handles American English names decently well but metaphone and double metaphone work better for more general cases
It is important to distinguish two use cases of phonetic algorithms:
- Recovering English spelling from English pronunciation. Soundex and Metaphone are good at this.
- Undoing the torments of Eastern-European sibilants in various transliterations of surnames, e.g. Tchaikovsky, Tschaikowski, Tchaïkovski, Ciajkovskij, Tsjaikovski, Tjajkovskij, Tsjajkovskij, Csajkovszkij, and Czajkowski. Here, Double Metaphone or Daitch–Mokotoff Soundex do a better job.
The second category essentially concerns spelling variations. Years and years ago I built a tools for searching old documents and to deal with all the spelling variants these contain I implemented something called the Gloria Guts algorithm, which ranks words based on how much their spelling differs. As I recall it worked much better than sounded for our data set
I was curious about whether or not there was an IPA version (which would then work for any language) -- one of the first results I found was Eudex, which looks a little different (since it's a hash of the pronunciation) but more versatile.
But how do you map from written form to IPA phonemes? For instance french is going to have a big problem there. Foie /fwa/ is a good example as has o, i and e written, but you pronounce a and some sort of u...
Why? All you need is the rules that govern French writing. They clearly exist, else humans couldn't read French either. It can't be harder than English where the rules amount to "look it up in the dictionary".
After using this, metaphone and other interesting similarity metrics for data cleaning in OpenRefine, I was looking for a python library that had these sorts of similarity metrics.
I found this quite interesting (and comprehensive) project that includes regular string distance (fuzzy search) methods, fingerprinting methods and a huge number of phonetic metrics including 7 different soundex methods: https://github.com/chrislit/abydos
During my graduate degree, I was looking into phonetic search, and a lot of the papers I stumbled across were using an expanded version of soundex called Phonix[1] as their basis of new algorithms. However, I had a lot of trouble finding an implementation I could use, so I implemented it in Python.
After that I build a search engine programme that would look up swear words that were phonetically similar to input terms, as I thought that my fellow students might get a laugh looking up their own names. In the end, neither Phonix, double metaphone nor Soundex really produced any funny results.
I once built a search engine for web agencies that you could search through by the technologies they worked with. All I did to make it fast was index all the data for a company into a few blobs and then fuzzy search the text blobs with Lucene, which used Levenshtein distances under the hood. It was lightening fast, stupid simple and all hosted on the same VM.
I once trained unsupervised character embeddings and used levenshtein distance with embedding cosine as character replacement weights. And it worked better as a similarity metric than soundex
Edit distance allows for insertion, deletion and substitution. Some allow custom cost for substitution (e.g. characters closer together on a keyboard have lower substitution cost than other; for typo normalization). What I describe in my comment is to use the cosine similarity between learned character embeddings as substitution costs.
I’m always annoyed that spellcheck/correction on iOS is really flawed in that is doesn’t try to take a guess at phonetic matching so you can’t just guess at how something is spelled phonetically you often have to have the first X letters correct.
I’ve found phonetic similarly algorithms like this to be very useful for data cleaning work, especially when your data generating process involves transliteration. That is, if your users know Language A but your data involves Language B, your users will record terms from Language B with highly varied spellings that nonetheless sound very similar when pronounced. Phonetic similarity algorithms help you cluster these spellings together by sound so you can map them to a canonical term from Language B.
When I implemented Soundex, an extension I did was to also return alternative hashes, for skipping of common surname prefixes. I didn't end up using it in production, so I can't say whether it would've helped, in practice.
Interestingly, `soundex` is bundled along with the standard functions in most commercial software. In one of my first search function I wrote, I used `soundex` to run against previous search words and suggest a known search word as 'did you mean?' to catch the spelling errors. This is around 2008-2010, and was an easy way out of what Google search was doing at that time. It was quite useful.
I wonder why Wisconsin went through the trouble vs some incrementing id. It seems like it was understood 2 people could end up with the same number so the last two digits were added. That would rule out the benefit of unique ids without requiring a single central system to ensure uniqueness.
It does make it harder to fake an id, perhaps that was the motivation.
It seems that something like this is in play in the Hacker News search. The first results will be alike, but the rest are using some soundex-like algorithm.
It works by favoring false positives over false negatives. In the specific case of Barbara, it doesn't matter how many syllables, because non-initial vowels are ignored.
[1] https://github.com/knadh/mlphone
[2] https://github.com/knadh/knphone