Whichlang – Fast, OSS for Language Detection in Rust

pipo234 · on May 19, 2023

Nice!

Did a similar thing in C++ way back when; routing short sentence fragments (questions in a bank website search box to the correct FAQs). One of the things we quickly discovered was how hard it is to separate Spanish (Castilian) from Catalan (and to less extent Bask). Similarly for Mandarin versus Cantonese.

At first we suspected that low precision could be attributed to the fact that users would typically only type short 2-5 word sentences (think Google searches). We later discovered talking to Barcelonean linguists (and Chinese native speakers) that another major factor is that so many words, idiom and even grammar is borrowed that it's hard to discern pure "uncontaminated" training / testing corpus to start with.

alexott · on May 19, 2023

Yes from my previous experience there were common language pairs, especially on the short texts - Spanish / Catalan, often German / Dutch, Bulgarian / Ukrainian, Dutch / Afrikaans, and few more…

allan_s · on May 19, 2023

shameless plug

for Tatoeba.org I used to code "tatodetect" to do this, https://github.com/allan-simon/Tatodetect

it's was my first enlighting that with simple math and stats you can achieve pretty good result without ressorting to "machine learning"

basically the idea was

1. use tatoeba to generate the trigram count 2. remove low frequency trigram 3. apply a bonus score for "uniq" trigram that appears in other languages (it helped to differentiate very similar languages like spanish/french or Chinese dialects)

then you simply sum the trigram with their frequency ponderation and voilà

Tatodetect should differentiate 200+ languages and dialect, regarding of their script (alphabet, sinograms etc. )

francoismassot · on May 19, 2023

This blog post introduces the algorithm behind: https://quickwit.io/blog/whichlang-language-detection-librar...

alexott · on May 19, 2023

Looking into generated ngrams, I’m not sure that it’s good idea of getting rid of spaces between words, and not having markers for begin/end of word.

It would be interesting to check on the dataset linked to this blog post from 6 years ago, evaluating fasttext model for language detection: https://alexott.blogspot.com/2017/10/evaluating-fasttexts-mo...

jszymborski · on May 19, 2023

I should probably bench this against fasttext, which has a pretty simple/fast/accurate lang detection model ootb.