Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.
That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.
Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.
What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.
Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.
While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.
That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".
Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.
That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.