Very good explanation of the current state of the art in STT. I have also personally gone down this rabbit hole(especially the CTC bit) and agree 100% with what the author says. My personal viewpoints:
1. To have an ImageNet moment, we should have an ImageNet. LibriSpeech doesn't cut it. We should have an equivalent SpeechNet. Problem is visual language is one, while speech languages are many. So we should have an ImageNet for every language? And train for every language?
2. What is an ImageNet moment? Personally, though ImageNet has contributed a lot, similar to speech and NLP, even vision applications have promised more and delivered less in real world scenarios. And just like speech and NLP, only the big boys actually provide solutions in vision which shows in vision also they have access to better data which they don't share.
I'm no way an expect in any of this, but I'd expect that there would be a lot of features common between a lot of languages, akin to the International Phonetic Alphabet [0]. Pre-training on all languages to get those shared features could make it easier to fine-tune eg. English on top perhaps. Or not, just pondering here.
I think it's worth pointing out that NLP's Imagenet moment is probably going to be seen as the ELMo/BERT papers that showed we could get significant performance improvements by pretraining models on large amounts of unlabeled text.
Maybe this is too hard on speech due to the intricacies of speech, but I wanted to point out that if the goal is transfer learning, the recipe doesn't have to be the same.
1. To have an ImageNet moment, we should have an ImageNet. LibriSpeech doesn't cut it. We should have an equivalent SpeechNet. Problem is visual language is one, while speech languages are many. So we should have an ImageNet for every language? And train for every language?
2. What is an ImageNet moment? Personally, though ImageNet has contributed a lot, similar to speech and NLP, even vision applications have promised more and delivered less in real world scenarios. And just like speech and NLP, only the big boys actually provide solutions in vision which shows in vision also they have access to better data which they don't share.