Very good explanation of the current state of the art in STT. I have also person...

ragebol · on April 6, 2020

I'm no way an expect in any of this, but I'd expect that there would be a lot of features common between a lot of languages, akin to the International Phonetic Alphabet [0]. Pre-training on all languages to get those shared features could make it easier to fine-tune eg. English on top perhaps. Or not, just pondering here.

[0] https://en.wikipedia.org/wiki/International_Phonetic_Alphabe...

tasogare · on April 6, 2020

Good luck finding records of the 6000+ existing languages...

nshm · on April 6, 2020

You can check https://github.com/festvox/datasets-CMU_Wilderness, it has recordings of 700 languages created from New Testaments from http://www.bible.is/

Eridrus · on April 6, 2020

I think it's worth pointing out that NLP's Imagenet moment is probably going to be seen as the ELMo/BERT papers that showed we could get significant performance improvements by pretraining models on large amounts of unlabeled text.

Maybe this is too hard on speech due to the intricacies of speech, but I wanted to point out that if the goal is transfer learning, the recipe doesn't have to be the same.