Did a similar thing in C++ way back when; routing short sentence fragments (questions in a bank website search box to the correct FAQs).
One of the things we quickly discovered was how hard it is to separate Spanish (Castilian) from Catalan (and to less extent Bask). Similarly for Mandarin versus Cantonese.
At first we suspected that low precision could be attributed to the fact that users would typically only type short 2-5 word sentences (think Google searches). We later discovered talking to Barcelonean linguists (and Chinese native speakers) that another major factor is that so many words, idiom and even grammar is borrowed that it's hard to discern pure "uncontaminated" training / testing corpus to start with.
Yes from my previous experience there were common language pairs, especially on the short texts - Spanish / Catalan, often German / Dutch, Bulgarian / Ukrainian, Dutch / Afrikaans, and few more…
it's was my first enlighting that with simple math and stats you can achieve pretty good result without ressorting to "machine learning"
basically the idea was
1. use tatoeba to generate the trigram count
2. remove low frequency trigram
3. apply a bonus score for "uniq" trigram that appears in other languages (it helped to differentiate very similar languages like spanish/french or Chinese dialects)
then you simply sum the trigram with their frequency ponderation and voilà
Tatodetect should differentiate 200+ languages and dialect, regarding of their script (alphabet, sinograms etc. )
Did a similar thing in C++ way back when; routing short sentence fragments (questions in a bank website search box to the correct FAQs). One of the things we quickly discovered was how hard it is to separate Spanish (Castilian) from Catalan (and to less extent Bask). Similarly for Mandarin versus Cantonese.
At first we suspected that low precision could be attributed to the fact that users would typically only type short 2-5 word sentences (think Google searches). We later discovered talking to Barcelonean linguists (and Chinese native speakers) that another major factor is that so many words, idiom and even grammar is borrowed that it's hard to discern pure "uncontaminated" training / testing corpus to start with.