Willow supports the Espressif ESP SR speech recognition framework to do completely on device speech recognition for up to 400 commands. When configured, we pull light and switch entities from home assistant and build the grammar to turn them on and off. There's no reason it has to be limited to that, we just need to do some extra work for better dynamic configuration and tighter integration with Home Assistant to allow users to define up to 400 commands to do whatever they want with their various entities.
With local command recognition with Willow I can turn my wemo switches on and off, completely on device, in roughly 300ms. That's not a typo. I'm going to make another demo video showing that.
We also support live streaming of audio after wake to our highly optimized Whisper inference server implementation (open source, releasing next week). That's what our current demo video uses[0]. It's really more intended for pro/commercial applications as it supports CPU but really flies with CUDA - where even on a GTX 1060 3GB you can do 3-5 seconds of speech in ~500ms or so.
We also plan to have a Willow Home Assistant component to support Willow "stuff" while enabling use of any of the STT/TTS modules in Home Assistant (including another component for our inference server you can self-host that does special Willow stuff).
I think controlling the odd device, setting a timer, adding items to a shopping list covers about 90% of my Alexa use. The remaining bits are asking it to play music, or dropping into another room. Seems like a good portion of these could be covered already.
Have you considered K2/Sherpa for ASR instead of ctranslate2/faster-whisper? It’s much better suited for streaming ASR (whisper transcribes 30 sec chunks, no streaming). They’re also working on adding context biasing using Aho-Corasick automata, to handle dynamic recognition of eg. contact list entries or music library titles (https://github.com/k2-fsa/icefall/pull/1038).
Whisper, per the model, does 30 second chunks and doesn't support "streaming" by the strict definition.
You'll be able to see when we release our inference server implementation next week that it's more than a version of "realtime" enough to fool nearly anyone, especially with an application like this where you aren't looking for model output in real time. You're streaming speech, buffering on the server, waiting for the end of voice activity detection, running Whisper, taking the transcription, and doing something with it. Other than a cool demo I'm not really sure what streaming ASR output provides but that's probably lack of imagination on my part :).
That said, these are great pointers and we're certainly not opposed to it! At the end of the day Willow does the "work on the ground" of detecting wake word, getting clean audio, and streaming the audio. Where it goes and what happens then is up to you! There's no reason at all we couldn't support streaming ASR output.
With local command recognition with Willow I can turn my wemo switches on and off, completely on device, in roughly 300ms. That's not a typo. I'm going to make another demo video showing that.
We also support live streaming of audio after wake to our highly optimized Whisper inference server implementation (open source, releasing next week). That's what our current demo video uses[0]. It's really more intended for pro/commercial applications as it supports CPU but really flies with CUDA - where even on a GTX 1060 3GB you can do 3-5 seconds of speech in ~500ms or so.
We also plan to have a Willow Home Assistant component to support Willow "stuff" while enabling use of any of the STT/TTS modules in Home Assistant (including another component for our inference server you can self-host that does special Willow stuff).
[0] - https://www.youtube.com/watch?v=8ETQaLfoImc