If I'm being perfectly honest I'm surprised we got it this far already. If I wanted to be really critical:
- Far-field speech is actually kind of hard. There are at least dozens of "knobs" we can tweak between the various component libraries, etc to improve speech quality and reliability for more users in more environments. We've tested as much as we can considering there's only two of us but we need more testing from more speakers in more environments.
- On the wire/protocol stuff. We're doing pretty rudimentary "open new connection, stream voice, POST somewhere". This adds extra latency and CPU usage because of repeated TLS handshakes, etc. We have plans to use Websockets and what-not to cut down on this.
- We don't really support audio playback yet. For a real "Amazon Echo" type experience you need to be able to ask it random things like "Hey what's the weather outside?" and it needs to "tell" you.
- Ecosystem support. Using the example above, something like Home Assistant or similar needs to know where you are, get the weather, do text to speech, etc for Willow to be able to play it back.
- Other integrations. Alexa has "skills" and stuff and we need to be able to talk to more things.
- UI/UX work. We support the touch display but we did just enough to show colors, print status, add a button, and make a touch cursor that follows your finger around. We also only give audio feedback with a kind-of annoying tone that beeps once for success and twice for failure.
- Speaking of failure, we don't do a great job of telling you what went wrong and where.
- Configuration and flashing. It's very static and has multiple steps. There are all kinds of things that need to get done to make Willow easy enough for less-technical users to deploy and actually use daily without any hassle.
- Local command recognition. It's very early but as noted in the README, wiki, etc the ESP BOX itself can recognize up to 400 commands directly on the device. In testing it works surprisingly well but we have a lot of work to do to make it actually practical for most people.
- Open sourcing our inference server. We plan to do this next week!
> the ESP BOX itself can recognize up to 400 commands directly on the device.
That's really cool! Does this mean 400 specific commands, e.g. "turn on the living room lights" or 400 commands that can be applied to different targets, e.g. "turn on the X lights" where X is some light. (400 actually feels like it would be enough to speed up the vast majority of interactions either way, but I'm curious :)
400 commands where "turn on X" is one and "turn off X" is two.
With Home Assistant this means turning on and off two hundred entities. We currently pull light and switch entities from Home Assistant and build the local Multinet speech grammar.
We have goals for better dynamic and adaptive configuration of Willow and part of that is using a Willow Home Assistant component with user configuration inthe HA dashboard, etc to easily select entities, define commands, etc and dynamically update all associated Willow devices.
We feel that with this 400 commands is enough to be practical and useful. Additionally, because the Multinet model returns probability on match to command "fuzzy matching" actually works quite well where "light", "lights", and slightly mis-worded commands still match correctly.
> - On the wire/protocol stuff. We're doing pretty rudimentary "open new connection, stream voice, POST somewhere". This adds extra latency and CPU usage because of repeated TLS handshakes, etc. We have plans to use Websockets and what-not to cut down on this.
I've recently used the Noise protocol[1] to do some encrypted communication between two services I control but separated by the internet.
Thanks for mentioning noise! I've certainly looked at it before but our challenge is the sheer scope of what we're doing. Not to mention (similar to WebRTC that people have asked about) I'm not completely understanding the fit and benefit for our use case and application.
I talk about websockets because they achieve our mission and goal (in this case shaving milliseconds off command -> action -> confirmation) with robust, battle-tested client implementations already available in the ESP framework libraries. Same thing for MQTT. Both are supported by Home Assistant (and almost everything else in the space) today.
Because of this existing framework support, we'll have websockets done today-ish. Then we can (for now) move on to all of the other things people have asked for :). Hah, priorities!
Not saying Noise won't/can't ever happen - just that this is a very ambitious project as it stands and we have plenty of work to do all over the place :)!
Want to write a noise implementation for ESP IDF :)?
Faster than Alexa (and only going to get faster)[0].
Between the far-field speech optimizations provided by the ESP BOX and Espressif frameworks and our inference server (open sourcing next week) using Whisper, and our unique streaming format we've found it to be comparable in terms of quality to Alexa/Echo even with background noise and at distances of up to 30 feet.
Not only are we working on improving performance with the inference server, local on device command recognition is extremely fast. Like "did that really just happen?" fast.
In my local setup when using locally-controlled Wemo switches I swear the latency with local devices is around 300ms or so.
I'm curious if this is something lightweight enough that might be possible to run as a Home Assistant add-on on relatively low-powered hardware such as an RPi.
I talk about this a bit on the wiki[0] but our goal is to have a Willow Home Assistant component do the Willow specific stuff and enable users to use any of the STT/TTS modules provided by Home Assistant.
We'll also (likely) be creating our own TTS/STT HA component for our inference server that does some special/unique things to support Willow.
If I'm being perfectly honest I'm surprised we got it this far already. If I wanted to be really critical:
- Far-field speech is actually kind of hard. There are at least dozens of "knobs" we can tweak between the various component libraries, etc to improve speech quality and reliability for more users in more environments. We've tested as much as we can considering there's only two of us but we need more testing from more speakers in more environments.
- On the wire/protocol stuff. We're doing pretty rudimentary "open new connection, stream voice, POST somewhere". This adds extra latency and CPU usage because of repeated TLS handshakes, etc. We have plans to use Websockets and what-not to cut down on this.
- We don't really support audio playback yet. For a real "Amazon Echo" type experience you need to be able to ask it random things like "Hey what's the weather outside?" and it needs to "tell" you.
- Ecosystem support. Using the example above, something like Home Assistant or similar needs to know where you are, get the weather, do text to speech, etc for Willow to be able to play it back.
- Other integrations. Alexa has "skills" and stuff and we need to be able to talk to more things.
- UI/UX work. We support the touch display but we did just enough to show colors, print status, add a button, and make a touch cursor that follows your finger around. We also only give audio feedback with a kind-of annoying tone that beeps once for success and twice for failure.
- Speaking of failure, we don't do a great job of telling you what went wrong and where.
- Configuration and flashing. It's very static and has multiple steps. There are all kinds of things that need to get done to make Willow easy enough for less-technical users to deploy and actually use daily without any hassle.
- Local command recognition. It's very early but as noted in the README, wiki, etc the ESP BOX itself can recognize up to 400 commands directly on the device. In testing it works surprisingly well but we have a lot of work to do to make it actually practical for most people.
- Open sourcing our inference server. We plan to do this next week!