What are the advantages of Whisper over a custom app running on a cheaper Andriod tablet? How much benefit is derived from the dual microphones, for instance? Are there hardware-level APIs that Whisper offers which a custom Android app wouldn't be able to access?
I don't mean to nitpick, this looks really cool. I've just got a few old tablets lying around, and I'm trying to decide whether to spring for one of these instead of trying to make something work on what I've already got.
Willow uses a local ML model for wake word detection. Once wake is detected, the actual speech recognition has one of two user configurable modes:
Local - Willow also includes the latest ESP SR Multinet 6 command recognition model. Willow will automatically pull entities from Home Assistant (when using Home Assistant) and define the speech grammar based on the friendly entity names. In this mode, the speech/audio never leaves the device and the speech recognition result is sent directly to Home Assistant.
Server - In this case, after wake is detected we immediately begin streaming audio to our highly optimized inference server implementation (release next week). Once end of speech is triggered using the ESP BOX Voice Activity Detection we send an end marker to execute Whisper on the server side and then take the results and send them to Home Assistant (when Home Assistant is configured).
Doing reliable wake word detection and getting clean far-field speech (generally defined as 3 meters or more) in random/unknown environments that are typically less than ideal (background noise, acoustic echo, etc) is actually quite a challenge. Willow uses the dual microphones and the ESP SR AFE (audio front end) to do a variety of signal processing on the device to clean the speech.
The integration and engineering for anything resembling an Echo-like experience is very involved, down to physical attributes of the enclosure, microphone cavities, etc. There is an entire field and cottage industry of acoustic engineering on the hardware design for these applications.
The point is, providing an Echo/Willow-like experience is much, much, much more than putting a random microphone in a room. So with that, we don't plan to specifically support random devices because the outcome is almost certainly very poor and not something we're currently interested in supporting.
All of this keeps comes coming up and we will certainly document it.
I don't mean to nitpick, this looks really cool. I've just got a few old tablets lying around, and I'm trying to decide whether to spring for one of these instead of trying to make something work on what I've already got.