Hacker News new | past | comments | ask | show | jobs | submit login
Live Captions: an application that provides live captions for the Linux desktop (github.com/abb128)
122 points by marcodiego on Dec 18, 2022 | hide | past | favorite | 20 comments



This is really cool. I really enjoyed having this built into my Pixel a few years ago (super useful when you want to watch a video in public but don't have headphones). The implementation in Chrome doesn't work that well.

It would be great to support both OpenAI's whisper model and a customized vocabulary (I find that a lot of transcription errors are because the set of words I'm exposed to don't necessarily fall inside the most common 50k or 100k words).


Looks like this isn't based on OpenAI's whisper, but on a model trained with RNNT loss using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). As such, it should in principle support customized vocabulary and LM for beam search decoding.


Is Whisper fast enough on a local machine to be giving live captions? I tried the python module and it takes a long time to process audio files.


There are various size options for the models. The choice of which you use trades off accuracy for higher performance.

There is also a C++ re-implementation that performs well and can definitely transcribe in realtime on many machines: https://github.com/ggerganov/whisper.cpp


I tried it on a laptop with 8th gen i5 (no dGPU) and transcribing an hour long interview took something like 2+ hours? I didn't time it precisely but it was slower than real time. uploading to otter.ai took a few minutes by comparison. The accuracy was pretty much the same. I have no clue how much faster a consumer device with a dedicated GPU might work though.


Are you on Windows? Chrome has live captions now, not sure if it has it on other OSes. I believe you can play local files in Chrome too.


No, I'm on ZorinOS. Takes twice as long as the clip length on small model.


> Only the English language is supported currently.

Shame, but maybe not surprising as it can't / doesn't use gpu. One of the most impressive things about the whisper demo was its ability to do live translation IMNHO.


What are live captions?


Text captions that are generated live for videos that don't already have a pre-existing subtitle/caption file.

Live Captions are a vital accessibility resource, especially for D/deaf and Hard of Hearing people.


Pixel phones have it and I love it


Every android phone seems to have it, but it is also only available for english.


Live Caption works on:

• Pixel 2 and later phones in English

• Pixel 6 and Pixel 6 Pro phones in English, French, German, Italian, Japanese and Spanish

• -- On Pixel 6 and Pixel 6 Pro phones, you can also identify these languages and translate media captions automatically through auto-detection.

• Other select Android phones

Source: https://support.google.com/accessibility/android/answer/9350...


I long for the day I can get accurate live transcriptions streamed to AR glasses. Would really revolutionize communication, but also language learning.


Between https://www.spectacles.com/new-spectacles and android's live caption, I bet you can do that, today


I wonder what data the model was trained on and what license the data and model are available under.


Looking at the repo, it appears to be based on a model trained using K2/icefall (https://github.com/k2-fsa/icefall; K2 is next generation of Kaldi). It claims to be based on a LSTM Transducer model trained on the Librispeech dataset as the base, and further "trained with some extra data". No further information is provided.

Librispeech is CC-BY 4.0 licensed (see http://www.openslr.org/12/).


And FYI if you use Mac, it's built into the OS.


Also built into Windows 11 as of the 22H2 release, just for the record.

That said, I may have to try this out - I've wanted to give Linux another go on my desktop but since I use captions rather heavily, that's been a disincentive. I'll have to see how this stacks up against other options; Apple's live captioning doesn't work as well as I'd like, Google's live captioning on the Pixel is great; Microsoft's live captions on Windows are pretty fantastic.

Unfortunately I'm tied to iOS for the longer-term since my hearing aid (a Resound model) integrates well with iPhones but not so much with Android, unless I want to buy an additional $400+ accessory.


I've been incredibly happy with the Win 11 captions as well; it's spoiled me on every other automatic captioning resource, honestly.

I saw a video (from Microsoft) about how their captions are best in class, especially for accents, and I can believe it. I work with people from around the world and the captions are almost always correct, except for contextually ambiguous words with a very heavy accent (like "job" for "shop").




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: