Willow supports the Espressif ESP SR speech recognition framework to do complete...

glenngillen · on May 16, 2023

Really interesting, thanks for replying.

I think controlling the odd device, setting a timer, adding items to a shopping list covers about 90% of my Alexa use. The remaining bits are asking it to play music, or dropping into another room. Seems like a good portion of these could be covered already.

woodson · on May 15, 2023

Have you considered K2/Sherpa for ASR instead of ctranslate2/faster-whisper? It’s much better suited for streaming ASR (whisper transcribes 30 sec chunks, no streaming). They’re also working on adding context biasing using Aho-Corasick automata, to handle dynamic recognition of eg. contact list entries or music library titles (https://github.com/k2-fsa/icefall/pull/1038).

kkielhofner · on May 15, 2023

Whisper, per the model, does 30 second chunks and doesn't support "streaming" by the strict definition.

You'll be able to see when we release our inference server implementation next week that it's more than a version of "realtime" enough to fool nearly anyone, especially with an application like this where you aren't looking for model output in real time. You're streaming speech, buffering on the server, waiting for the end of voice activity detection, running Whisper, taking the transcription, and doing something with it. Other than a cool demo I'm not really sure what streaming ASR output provides but that's probably lack of imagination on my part :).

That said, these are great pointers and we're certainly not opposed to it! At the end of the day Willow does the "work on the ground" of detecting wake word, getting clean audio, and streaming the audio. Where it goes and what happens then is up to you! There's no reason at all we couldn't support streaming ASR output.