Have you considered K2/Sherpa for ASR instead of ctranslate2/faster-whisper? It’...

kkielhofner · on May 15, 2023

Whisper, per the model, does 30 second chunks and doesn't support "streaming" by the strict definition.

You'll be able to see when we release our inference server implementation next week that it's more than a version of "realtime" enough to fool nearly anyone, especially with an application like this where you aren't looking for model output in real time. You're streaming speech, buffering on the server, waiting for the end of voice activity detection, running Whisper, taking the transcription, and doing something with it. Other than a cool demo I'm not really sure what streaming ASR output provides but that's probably lack of imagination on my part :).

That said, these are great pointers and we're certainly not opposed to it! At the end of the day Willow does the "work on the ground" of detecting wake word, getting clean audio, and streaming the audio. Where it goes and what happens then is up to you! There's no reason at all we couldn't support streaming ASR output.