I work in a startup doing STT. We have reached SOTA on lowest CER in our language among industry. The main reason we are doing well is not because we have smart engineers tuning on fancy model, but rather we developed novel method to collect tremendous amount of usable data from via internet (crawling speech and text transcripts, using subtitles from movies, etc). Implementing interesting paper improves 1%, but pouring in more data improves 10%. I guess this is why big guys aren't exposing what data they've used.
It takes a fortune to collect just 100 hours of clean speech-to-text labeled data, and they will never meet user expectations in market.
Also, we have developed our internal framework that eases pretraining - finetuning to subtask pipeline. After months of usage it required a lot of refactoring to match the need of all forms of DNN models. I would like to hear from other ML engineers if they have a internal framework which generalizes at least one subfield of DNN(nlp, vision, speech etc)
What's an example of an NLP application that "meets user expectations" in the wild? Google's natural language stuff just annoys me. Facebook and google translate don't seem good after multiple breakthrough announcements.
Edit: Sota, state of the art, is taken incredibly glibly, as if that was enough to make a given application a success. This seems like a massive overstatement.
I agree with many points of the article. I do think we’re closer to an English STT ImageNet equivalent than you think. For most other languages I don’t think it’s possible until data is collected/released, or some kind of automatic data collection process becomes standard that can generate 10k hours of usable training data for each of a bunch of arbitrary languages.
Seeing better “state of the art” results on librispeech is far less interesting to me than two recent developments:
- Facebook’s steaming convnet wav2letter model/arch, pretrained on 60k hours of audio, which I’ve been using for transfer learning with great results. It’s fast, not too huge, runs on cpu, and has excellent results without a language model.
- The librilight dataset. I aligned 30k hours of it (on CPU, on a single large server, in about a week) and my resulting models are generalizing very well.
> The librilight dataset. I aligned 30k hours of it (on CPU, on a single large server, in about a week) and my resulting models are generalizing very well.
Was this using the pretrained model from Facebook? If so, how much custom code did you need to write to get the alignment? I've been looking for a way to take an arbitrary position in a given text and look up where in the corresponding audio it appears, but I'd prefer not having to train a speech recognition model to do that.
DSAlign only generates roughly sentence level alignment, which still may be a good start for you. It works by segmenting the audio by pauses, transcribing the segments with a strict language model, then figuring out where in the text the segments are based on the transcript. wav2train then generates audio segments using the alignment, but you could edit it to stop there and just look at the tlog file that tells you where the sentences are in the original file.
I’ve also used the Gentle Forced Aligner and Montreal Forced Aligner. MFA wasn’t too hard to set up for English, but I don’t recommend running it on large batches of audio, as it was slow and unreliable at large scale for me.
As another practitioner, I find it odd that the author mentions frameworks such as OpenNMT but not Kaldi when comparing SST toolkits/frameworks. Overall, I get the feeling that the author hasn't worked with speech data for too long.
I agree with some of the general points the author makes, such as papers having tons of equations just for equations sake, reproducibility, not enough details to reproduce (in a conference paper..), chasing SOTA on some dataset, but these aren't limited to speech processing research. Large companies don't release their internal datasets for voice search (Google Assistant, Alexa, etc.), call center, or clinical applications, as there is no incentive to do so and also because they likely can't for licensing and data rights issues. By the way, what's the situation for the author's OpenSTT corpus? Did the people speaking on the thousands of hours of phone calls release them under a free license?
Seems like the title should really be criticisms of STT groups in the industry, work from academic groups is ignored.
The original article was kind of silly anyways, it was all engineering work, very little research. Engineering work is important, but you're not going to get an image net moment in ASR by twiddling with hyperparameters or reducing training time. Also a bizarre decision to pick deepspeech as a framework, of course your models will take forever to train and won't even be that good in the end. It's end-to-end and hasn't been SOTA for a while.
> It is hard to include the original Deep Speech paper in this list, mainly because of how many different things they tried, popularized, and pioneered.
Had to laugh at this. What exactly did they "popularize and pioneer" other than the practice of training and reporting results on very large private datasets? [This, among others, was a much more important end-to-end paper I think.](https://arxiv.org/pdf/1303.5778.pdf)
There are some good points about using private datasets and overfitting on a read-speech dataset, and using very large models etc. but I really wish they would have used their data to train a kaldi system. I can guess why they didn't, they have no background in speech and found it too hard to use. Still disappointing.
Anyways, I would argue the reason that a "imagenet moment" hasn't arrived for ASR is that vision is universal, but speech splits into different languages, making it much harder to build a single model that everyone else can use as a seed model. I believe multilingual models are the future.
> the provided scripts are useless to anyone without access to the academic datasets they use
Kaldi team shares all datasets when possible on http://openslr.org, a major collection of speech datasets. Librispeech and Tedlium were major breakthrough in their times. When everyone in research trained on 80 hours WSJ and Google trained on 2000 hours private dataset, 1000 hours of librispeech was a game changer.
> Kaldi is very cumbersome and overly complex to use and adapt (most of the toolchain relies on exact copies of entire directory trees)
On the other hand the Kaldi API is very stable. You still can run 4 year old recipes with simple run.sh. Any tensorflow software gets obsolete once in 3 months when TF API changes.
Kaldi recipe results are usually reproducible up to the digit and you can even see the tuning history (which features did improve, which didn't).
> Kaldi is a research tool by researchers for researchers and not in any way shape or form aimed at practitioners in search of a deployable solution
There are hundreds of companies all over the world building practical solutions with Kaldi. The only thing that Kaldi team should do is to ask users to mention it.
> Kaldi is very poorly documented (from a user perspective) and focuses on recipes for your own experiments, not writing applications and roll out
There are also dozen of projects on Github which enable use of Kaldi with simple pip install or docker pull. kaldi-gstreamer-server, kaldi-active-grammar, vosk and many others.
The provided scripts are available for a huge number of datasets. Many of the datasets are also useful for use in production. And it's very simple to adapt the provided scripts for further data.
Kaldi is complex and big, yes. But you don't have to use all of it. And it is that way because doing STT properly simply is complex. At least for the conventional standard hybrid HMM/NN models. You could surely go some of the end-to-end ways but that will lead to other problems in production. And once you need to deal with these problems, it might become even more complicated. (This is a whole topic on its own...)
Kaldi is used in research as well as in production. It can be easily deployed.
Kaldi is very well documented. Just check the homepage of it. In addition to it, there is a huge community (both academia and industry).
Interesting (possible) additional piece of the puzzle on the industry:
As opposed to the explosion of CV training data, big players may find themselves thinking they can't expose the tagged raw data for audio the same way that tagged raw data for images was exposed to the public. The privacy backlash on the way some previous public-release datasets have been used was notable.
Has the ecosystem changed so the kind of mass-collaboration that led to ImageNet can't be repeated due to privacy concerns of how the audio could be (ab)used?
Very good explanation of the current state of the art in STT. I have also personally gone down this rabbit hole(especially the CTC bit) and agree 100% with what the author says. My personal viewpoints:
1. To have an ImageNet moment, we should have an ImageNet. LibriSpeech doesn't cut it. We should have an equivalent SpeechNet. Problem is visual language is one, while speech languages are many. So we should have an ImageNet for every language? And train for every language?
2. What is an ImageNet moment? Personally, though ImageNet has contributed a lot, similar to speech and NLP, even vision applications have promised more and delivered less in real world scenarios. And just like speech and NLP, only the big boys actually provide solutions in vision which shows in vision also they have access to better data which they don't share.
I'm no way an expect in any of this, but I'd expect that there would be a lot of features common between a lot of languages, akin to the International Phonetic Alphabet [0]. Pre-training on all languages to get those shared features could make it easier to fine-tune eg. English on top perhaps. Or not, just pondering here.
I think it's worth pointing out that NLP's Imagenet moment is probably going to be seen as the ELMo/BERT papers that showed we could get significant performance improvements by pretraining models on large amounts of unlabeled text.
Maybe this is too hard on speech due to the intricacies of speech, but I wanted to point out that if the goal is transfer learning, the recipe doesn't have to be the same.
- the big boys (Google, Baidu, Facebook etc - basically FAANG) are getting SOTA using private data without explicitly stating it
- their toolkits are too complicated and not suited to non-FAANG-scale datasets/problems
- published results aren't reproducible and academic papers often gloss over important implementation details (e.g which version of CTC loss is used).
- SOTA/leaderboard chasing is bad and there's a general over-reliance on large compute, so non-FAANG inevitably end up overfitting on publicly available datasets (LibriSpeech).
I'm far from an expert, but having spent the last 6 months familiarising myself with the STT landscape, I would say I mostly agree.
First, the author wants more like Mozilla's Common Voice. Not sure if he/she is aware, but I think the M-AILABS speech dataset (https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) will go a long way to "democratizing" STT.
Second, agree that the toolkits are all complicated and weighted in favour of highly parameterized NNs. This works great if you have FAANG-style data (and hardware), but really isn't pushing the field forward in other respects (compute/training time, parameterization/etc). In all fairness, though, HMM-based frameworks like Kaldi are notoriously complex and take significant effort to wrap your head around. I think speech is just inherently more complex than image or text.
Third - and this applies to ML in general - agree that papers all too often gloss over the important details. I see no reason why publications shouldn't all be accompanied by code and dataset hashes.
I think we're seeing FAANG all converge on a non-optimal solution, simply because we've effectively removed resource constraints. When your company doesn't blink at spending a few million on creating its own datasets, and you have massive TPU farms available at your disposal, there's no incentive to focus on sample- or compute- efficient models.
What's more, they also have the resources to market their successes much more effectively. Think open source frameworks, PR pieces, new product releases, and so on.
This effect is two-fold. First, it creates a lot of noise that crowds out the slower, smaller contributors. Second, it also draws a lot of junior research attention - who would want to risk 4 years of their life on a pathway that might not work when NNs are here today and there's clear evidence that they work (albeit only for certain problems in certain environments with certain datasets).
If so, I haven't dug deeply yet, but I suspect the author's point stands - inference-time efficiency is one thing, but training compute (and separately, data-efficiency) is another.
I don't think the author is necessarily criticizing anyone, and I don't think he/she is saying that these tools/models/papers/etc aren't welcome. My read is that "pushing STT beyond the FAANG use-case means we should stop confining ourselves to FAANG-scale data and models".
On a somewhat related note, why does the Google STT engine have a grudge against the word "o'clock"?
If you say "start at four" it gives you the text "start at 4." So far so good, but if you say "start at four o'clock" you just get "start at 4" again, not "start at 4:00" like you wanted. In fact you can say the word o'clock over and over for several seconds and it will type absolutely nothing.
Having recently experienced a similar issue with "filler" words that mysteriously disappeared in a transliteration pipeline, I'd guess that this is due to an artifact in the training data. Most likely the STT engine was trained on transcriptions created by lazy humans who simply wrote "4" when they heard "four o'clock", so the engine learned that "o'clock" is a speech disfluency just like "umm" that usually doesn't appear in writing.
I just tried this on the google assistant and didn't get the behavior you describe. When you look at the streaming results, you can clearly see it produce the o'clock, but it gets normalized to 4:00 at the end.
As someone who thinker with Text to Speech (TTS), I can say this apply to TTS as well. Good model such as Tacotron2 rarely scale beyond clean ( good text and speech alignmen ) large ( > 12 hours ) datasets.
I see the same issue in optical flow. Everyone and their grandma are overfitting Sintel.
But everyday phenomena, like changing light conditions or trees with many small branches cause the SOTA to fail miserably.
I even approached research teams with my new data set that they could use to measure generalization, but the response has been frosty or hostile. Nobody likes to confirm that their amazing new paper is worthless for real world usage.
This is one of the reasons why I never got myself to finish my master's thesis.
My advisor advised me to use synthetic data, because the work I was supposed to build upon did as well.
I insisted on real data which I generated myself and, unsurprisingly, it showed how undoable this whole thing was(using stereo audio data to localise speakers) with the suggested methods.
'text to speech' is good for tapping people's phones. Voice user interfaces don't get good enough until they can hold a conversation. That is, ask for clarification if you don't understand what people say. Superhuman performance at TTS isn't good enough to replace a keyboard with a backspace button.
yep, that's the kind of mistake that a human can catch... remember some of those errors are inserted at the transmitter, some along the channel, and some at the receiver. If a receiver is going to get great accuracy, it has to catch errors introduced anywhere!
I feel like there is a project waiting to happen (or that I'm unaware of) to use movies/TV and captions to extract well-annotated audio.
While for copyright reasons this dataset likely can't be released publicly, the method to generate it from materials should be possible to release so that IF you have the source, you can get the annotated audio.
Audiobooks could be another resource, but they’ll only give you a narrator’s intonation. On the other hand, at some point it may become easier to train people to speak in a certain way to computers instead of training computers to understand the unclear mumblings of casual speech.
Yes, that's in fact the source of the commonly cited LibriSpeech dataset. It used publicly available recordings from the LibriVox project and has done some cleaning steps as well as alignment of the transcript with the audio to arrive at 1000 hours of cleaned up audio.
The M-AILABS dataset uses LibriVox as well, among other sources, to arrive at 1000 hours of cleaned up data in various european languages.
But you'll have to give up properties like gender balancing in your dataset that then has to be counteracted during the learning process, e.g. by having gender-custom learning rates.
Facebook’s librilight taps this potential. They have a script for easily fetching an entire language from librivox, and I have a pipeline on top of that for book fetching and transcript alignment that I’ve successfully trained from.
I think the next step here needs to be auto-finding librivox source books that aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only found books for 20% of German. I don’t actually know where to look beyond Gutenberg for books that don’t have easily linked text. Common crawl? Librivox forums?
(Or you can take Facebook’s semi-supervised approach and generate a transcript from thin air instead, but imo that will work better for English than languages that don’t have good baseline models yet)
I don't really see any issues with speech-to-text when I use google assistant. The real lack is in intelligence, even for basic things. I ask Google to navigate to Wegmans. It always insists on asking me which one. Yes it cleverly lets me say, "the closest one", but this becomes a whole lot less clever when you have to tell it "the closest one" every single time. There has never been a single case where I would prefer to drive further to a store that has exactly identical layouts between them. But you can't tell it "the closest one, always".
Lucky you is a probably a native (US) English speaker.
I dare you to try any STT in more "niche" languages or dialects. US English should not be considered the be-all-end-all indicator for progress in this area of research.
A few years back the BBC did a documentary series on fishermen from roughly the same part of Scotland I grew up - to aid understanding it was subtitled!
What I thought amusing was the it was pretty clear that the people shown talking are trying really hard to talk "properly" - their real accents would probably be much broader.
Pocketsphinx[1]. I used it, and pico2wave along with some old-school interactive-fiction style code to create a "computer" for my open source space game[2] so I can drive the spaceship around by talking to it[3][4].
Hi, my name is Alexander, I one of authors of both Gradient pieces, Open STT and silero.ai.
We are planning on adding 2-3 new languages soon (English, German, maybe Spanish), so if you would like us to fully open source everything - please support us here - https://opencollective.com/open_stt
A lot of feedback here, very nice!
As with the Russian Speech community for some reason people are willing to share their feedback on third-party forums, but do not bother to write to the author directly.
If you have something to say - please email me - aveysov@gmail.com or telegram me (@snakers4 or @snakers41) if you would a faster reply.
With this all addressed, to answer some common criticisms / points raised below:
> Kaldi cool, NNs not cool
Though it is true that production solutions CAN be built with Kaldi (and I have even spoken with people who built them, as well as vosk author), we purposely omitted Kaldi because we believe that it is a technological dead-end. Also it depends a lot on antiquated technology, is very difficult for new-comers etc etc
Another problem is that ... it requires a lot of specific knowledge to add languages. Whereas our approach is just plug-and-play - just add more data!
For example - our best model in production is just 300-500 lines of code in PyTorch.
With so much resources poured in tools like PyTorch, it just makes sense to use tools that are simple / robust / offer a lot of future-proofing and cool features etc etc Also you can ofc switch DL frameworks if you wish =)
Also you can do CV, NLP, etc with PyTorch unlike Kaldi.
As someone noted, does not really matter what you are using, if it suits your compute.
Fully e2e approaches still are GAFA scale only, and I just realized why Google even bothers with them - please see my post here https://t.me/snakers4/2445
> ML APIs changing w/o backwards compatibility
TF does that, and it is a joke. Paid marketing says that TF is cool, but real practicioners (that I know personally) use PyTorch.
It is a holywar, but I believe TF just has a lot of captive audience ...
PyTorch core API on the other hand has been mostly stable since 0.4 (now it is on 1.4)! (!!!) No one speaks about this for some reason.
Check out Open STT!
We have not been able to OSS everything, because Russian corporations would use it w/o even crediting us.
But you can support us and change it. Also other languages.
caito.de is really cool, albeit very small
common voice - is very call, but on the smaller side
libri-light - you have to align it yourself, so no difference really - easier to make it from scratch
Also I do not understand why people use flac, when opus exists?
Hi Alexander,
I appreciate you posting here; I didn't feel too strongly about this to sign up for an account to comment directly on TheGradient.
> Kaldi cool, NNs not cool
It seems you're misrepresenting things. Kaldi models are NNs, they are trained using a sequence-level objective function (LF-MMI, not entirely unlike CTC), they can also be trained completely end2end (without using HMM-GMM systems to generate alignments for CE regularization and obtaining lattices). The Kaldi authors are currently working on moving NN training to PyTorch.
The fact is that, if you need a complete easy-to-use off-the-shelf solution ("Just give me the transcription of that audio file!"), Kaldi is not it. You need to train models and you need to use other projects, such as vosk, kaldi-gstreamer-server, etc. or build your own solution. People have done so and use it in production successfully. Use the right tool for the job, a poor workman blames his tools, yadda yadda.
Perhaps it's just that speech recognition IS difficult, there is no one-size-fits-all-use-cases solution, and when people can't figure out Kaldi, they seem to think they need to throw out everything and pray that CTC will fix it somehow. Get sufficient data in your domain and things might work better. Don't use a couple of audiobooks for training, expecting that it will perfectly transcribe Indian accented English callcenter conversations. I really appreciate your effort on releasing OpenSTT, in particular that you provide data in different conditions and domains (phone calls, lectures, broadcast radio, etc.). This is useful no matter what particular tool (Kaldi, DeepSpeech, w2l, RNN-T, transformers) people are using.