This is really cool. I really enjoyed having this built into my Pixel a few years ago (super useful when you want to watch a video in public but don't have headphones). The implementation in Chrome doesn't work that well.
It would be great to support both OpenAI's whisper model and a customized vocabulary (I find that a lot of transcription errors are because the set of words I'm exposed to don't necessarily fall inside the most common 50k or 100k words).
Looks like this isn't based on OpenAI's whisper, but on a model trained with RNNT loss using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). As such, it should in principle support customized vocabulary and LM for beam search decoding.
I tried it on a laptop with 8th gen i5 (no dGPU) and transcribing an hour long interview took something like 2+ hours? I didn't time it precisely but it was slower than real time. uploading to otter.ai took a few minutes by comparison. The accuracy was pretty much the same. I have no clue how much faster a consumer device with a dedicated GPU might work though.
> Only the English language is supported currently.
Shame, but maybe not surprising as it can't / doesn't use gpu. One of the most impressive things about the whisper demo was its ability to do live translation IMNHO.
I long for the day I can get accurate live transcriptions streamed to AR glasses. Would really revolutionize communication, but also language learning.
Looking at the repo, it appears to be based on a model trained using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). It claims to be based on a LSTM Transducer model trained on the Librispeech dataset as the base, and further "trained with some extra data". No further information is provided.
Also built into Windows 11 as of the 22H2 release, just for the record.
That said, I may have to try this out - I've wanted to give Linux another go on my desktop but since I use captions rather heavily, that's been a disincentive. I'll have to see how this stacks up against other options; Apple's live captioning doesn't work as well as I'd like, Google's live captioning on the Pixel is great; Microsoft's live captions on Windows are pretty fantastic.
Unfortunately I'm tied to iOS for the longer-term since my hearing aid (a Resound model) integrates well with iPhones but not so much with Android, unless I want to buy an additional $400+ accessory.
I've been incredibly happy with the Win 11 captions as well; it's spoiled me on every other automatic captioning resource, honestly.
I saw a video (from Microsoft) about how their captions are best in class, especially for accents, and I can believe it. I work with people from around the world and the captions are almost always correct, except for contextually ambiguous words with a very heavy accent (like "job" for "shop").
It would be great to support both OpenAI's whisper model and a customized vocabulary (I find that a lot of transcription errors are because the set of words I'm exposed to don't necessarily fall inside the most common 50k or 100k words).