Live Captions: an application that provides live captions for the Linux desktop

singhrac · on Dec 18, 2022

This is really cool. I really enjoyed having this built into my Pixel a few years ago (super useful when you want to watch a video in public but don't have headphones). The implementation in Chrome doesn't work that well.

It would be great to support both OpenAI's whisper model and a customized vocabulary (I find that a lot of transcription errors are because the set of words I'm exposed to don't necessarily fall inside the most common 50k or 100k words).

woodson · on Dec 18, 2022

Looks like this isn't based on OpenAI's whisper, but on a model trained with RNNT loss using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). As such, it should in principle support customized vocabulary and LM for beam search decoding.

NayamAmarshe · on Dec 18, 2022

Is Whisper fast enough on a local machine to be giving live captions? I tried the python module and it takes a long time to process audio files.

deet · on Dec 18, 2022

There are various size options for the models. The choice of which you use trades off accuracy for higher performance.

There is also a C++ re-implementation that performs well and can definitely transcribe in realtime on many machines: https://github.com/ggerganov/whisper.cpp

notjulianjaynes · on Dec 18, 2022

I tried it on a laptop with 8th gen i5 (no dGPU) and transcribing an hour long interview took something like 2+ hours? I didn't time it precisely but it was slower than real time. uploading to otter.ai took a few minutes by comparison. The accuracy was pretty much the same. I have no clue how much faster a consumer device with a dedicated GPU might work though.

belltaco · on Dec 18, 2022

Are you on Windows? Chrome has live captions now, not sure if it has it on other OSes. I believe you can play local files in Chrome too.

NayamAmarshe · on Dec 18, 2022

No, I'm on ZorinOS. Takes twice as long as the clip length on small model.

e12e · on Dec 18, 2022

> Only the English language is supported currently.

Shame, but maybe not surprising as it can't / doesn't use gpu. One of the most impressive things about the whisper demo was its ability to do live translation IMNHO.

sneilan1 · on Dec 18, 2022

What are live captions?

0ct4via · on Dec 18, 2022

Text captions that are generated live for videos that don't already have a pre-existing subtitle/caption file.

Live Captions are a vital accessibility resource, especially for D/deaf and Hard of Hearing people.

zoklet-enjoyer · on Dec 18, 2022

Pixel phones have it and I love it

jonas-w · on Dec 18, 2022

Every android phone seems to have it, but it is also only available for english.

0ct4via · on Dec 18, 2022

Live Caption works on:

• Pixel 2 and later phones in English

• Pixel 6 and Pixel 6 Pro phones in English, French, German, Italian, Japanese and Spanish

• -- On Pixel 6 and Pixel 6 Pro phones, you can also identify these languages and translate media captions automatically through auto-detection.

• Other select Android phones

Source: https://support.google.com/accessibility/android/answer/9350...

martindbp · on Dec 18, 2022

I long for the day I can get accurate live transcriptions streamed to AR glasses. Would really revolutionize communication, but also language learning.

fragmede · on Dec 18, 2022

Between https://www.spectacles.com/new-spectacles and android's live caption, I bet you can do that, today

pabs3 · on Dec 18, 2022

I wonder what data the model was trained on and what license the data and model are available under.

woodson · on Dec 18, 2022

Looking at the repo, it appears to be based on a model trained using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). It claims to be based on a LSTM Transducer model trained on the Librispeech dataset as the base, and further "trained with some extra data". No further information is provided.

Librispeech is CC-BY 4.0 licensed (see http://www.openslr.org/12/).

solardev · on Dec 18, 2022

And FYI if you use Mac, it's built into the OS.

shade · on Dec 18, 2022

Also built into Windows 11 as of the 22H2 release, just for the record.

That said, I may have to try this out - I've wanted to give Linux another go on my desktop but since I use captions rather heavily, that's been a disincentive. I'll have to see how this stacks up against other options; Apple's live captioning doesn't work as well as I'd like, Google's live captioning on the Pixel is great; Microsoft's live captions on Windows are pretty fantastic.

Unfortunately I'm tied to iOS for the longer-term since my hearing aid (a Resound model) integrates well with iPhones but not so much with Android, unless I want to buy an additional $400+ accessory.

sdrothrock · on Dec 18, 2022

I've been incredibly happy with the Win 11 captions as well; it's spoiled me on every other automatic captioning resource, honestly.

I saw a video (from Microsoft) about how their captions are best in class, especially for accents, and I can believe it. I work with people from around the world and the captions are almost always correct, except for contextually ambiguous words with a very heavy accent (like "job" for "shop").