Whisper democratises high-quality transcription of the languages I personally ca...

abraxas · on March 30, 2023

Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.

Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.

codethief · on March 30, 2023

At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.

aftbit · on March 30, 2023

What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.

selfhoster11 · on March 30, 2023

Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.

While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.

MacsHeadroom · on March 30, 2023

Faster Whisper is 8x faster than real time on CPU and even faster on GPU. https://github.com/guillaumekln/faster-whisper

Vocode uses Whisper for real-time zero latency voicechat with chatGPT. Give their demo line a call to see how well it works: +1-650-729-9536

rolisz · on March 30, 2023

On a 1080Ti (6 year old GPU) I found Whisper large models to take around as much time to transcribe as the length of the audio.

inciampati · on March 30, 2023

That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".

pantalaimon · on March 30, 2023

Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.

selfhoster11 · on March 30, 2023

Try quantised models. They perform reasonably well, although you probably want to run some benchmarks if you really want to get it done properly.