Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Willow – Open-source privacy-focused voice assistant hardware (github.com/toverainc)
581 points by kkielhofner on May 15, 2023 | hide | past | favorite | 138 comments
As the Home Assistant project says, it's the year of voice!

I love Home Assistant and I've always thought the ESP BOX[0] hardware is cool. I finally got around to starting a project to use the ESP BOX hardware with Home Assistant and other platforms. Why?

- It's actually "Alexa/Echo competitive". Wake word detection, voice activity detection, echo cancellation, automatic gain control, and high quality audio for $50 means with Willow and the support of Home Assistant there are no compromises on looks, quality, accuracy, speed, and cost.

- It's cheap. With a touch LCD display, dual microphones, speaker, enclosure, buttons, etc it can be bought today for $50 all-in.

- It's ready to go. Take it out of the box, flash with Willow, put it somewhere.

- It's not creepy. Voice is either sent to a self-hosted inference server or commands are recognized locally on the ESP BOX.

- It doesn't hassle or try to sell you. If I hear "Did you know?" one more time from Alexa I think I'm going to lose it.

- It's open source.

- It's capable. This is the first "release" of Willow and I don't think we've even begun scratching the surface of what the hardware and software components are capable of.

- It can integrate with anything. Simple on the wire format - speech output text is sent via HTTP POST to whatever URI you configure. Send it anywhere, and do anything!

- It still does cool maker stuff. With 16 GPIOs exposed on the back of the enclosure there are all kinds of interesting possibilities.

This is the first (and VERY early) release but we're really interested to hear what HN thinks!

[0] - https://github.com/espressif/esp-box




Some feedback to make your project easier to install and integrate better with Home Assistant (I'm the founder):

Home Assistant is building a voice assistant as part of our Year of the Voice theme. https://www.home-assistant.io/blog/2023/04/27/year-of-the-vo...

As part of our recent chapter 2 milestone, we introduced new Assist Pipelines. This allows users to configure multiple voice assistants. Your project is using the old "conversation" API. Instead it should use our new assist pipelines API. Docs: https://developers.home-assistant.io/docs/voice/pipelines/

You can even off-load the STT and TTS fully to Home Assistant and only focus on wake words.

You will see a lot higher adoption rate if users can just buy the ESP BOX and install the software on it without installing/compiling stuff. That's exactly why we created ESP Web Tools. It offers projects to offer browser-based installation directly from their website. https://esphome.github.io/esp-web-tools/

If you're going the ESP Web Tools route (and you should!), we've also created Improv Wi-Fi, a small protocol to configure Wi-Fi on the ESP device. This will allow ESP Web Tools to offer an onboarding wizard in the browser once the software has been installed. More info at https://www.improv-wifi.com/

Good luck!


Home Assistant would be a lot more convincing if every upgrade did not completely break my install.

Flashed this on ESP I had laying around and did NOT have to upgrade HA (which would have made me not try the project).


HA would be a lot more convincing if basic layout itself alongside config wasn't YAML hell. Every time I want to create some new layout or add something new to my home screen, I dread it.

I hate using it. Yet, I have no viable OSS alternatives.


openHAB is very nice and completely OSS.


Hm, I forgot about openHAB. Does it have comparable number of integrations as HA does?


can you share more details about what's breaking? Is it a specific integration? Is it in general? What breaks? This is not consistent with most users' experience but it's hard to know without more specifics.


Some of the things that happened to me during the last 18 months:

- ChangeOver to the new bluetooth subsystem broke many integrations. My Bluetooth TRVs still don't work right (again).

- ONVIF support recently broke for an (admittedly shitty old) IP-Webcam. PTZ never worked/was_exposed.

- My USB-connected android devices can't be be controlled by the ADB-Integration anymore. There was some integration renaming/rescoping recently.

Home-Assistant still (imho) is best solution in this space for most combinations of metrics. I'd still recommend it to anyone.

(I tinker a lot with my HA-install/network, so maybe some of the above are issues on my end)


> Flashed this on ESP I had laying around

So question is - what do you think :)?


Hey there!

First of all, everyone involved in this project has been big fans and users of HA for many years (in my case at least a decade). THANK YOU! For now Willow wouldn't do anything other than light up a display and sit there without Home Assistant.

We will support the pipelines API and make it a configuration option (eventually default). HA has very rapid release cycles and as you note this is very new. At least for the time being we like the option of people being able to point Willow at older installs and have it "do something" today without requiring an HA upgrade that may or may not include breaking changes - hence the conversation API.

One of our devs is a contributor for esphome and we're heading somewhere in that direction, and he's a big fan of improv :).

We have plans for a Willow HA component and we'd love to run some ideas past the team. Conceptually, in my mind, we'll get to:

- Flashing and initial configuration from HA like esphome (possibly using esphome, but the Espressif ADF/SR/LCD/etc frameworks appear to be quite a ways out for esphome).

- Configuration for all Willow parameters from wifi to local speech commands in the HA dashboard, with dynamic and automatic updates for everything including local speech commands.

- OTA update support.

- TTS and STT components for our inference server implementation. These will (essentially) be very thin proxies for Willow but also enable use of TTS and STT functionality throughout HA.

- Various latency improvements. As the somewhat hasty and lame demo video illustrates[0] we're already "faster" than Alexa while maintaining Alexa competitive wake word, voice activity detection, noise suppression, far-field speech quality, accuracy, etc. With local command recognition on the Willow device and my HA install using Wemo switches (completely local) it's almost "you can't really believe it" fast and accurate.

I should be absolutely clear on something for all - our goal is to be the best hardware voice interface in the world (open source or otherwise) that happens to work very well with Home Assistant. Our goal is not to be a Home Assistant Voice Assistant. I hope that distinction makes at least a little sense.

You and the team are doing incredible work on that goal and while there is certainly some overlap we intend to maintain broad usability and compatibility with just about any platform (home automation, open source, closed source, commercial, whatever) someone may want to use Willow with.

In fact, our "monetization strategy" (to the extent we have one) is based on the various commercial opportunities I've been approached with over the years. Turns out no one wants to see an Amazon Echo in a doctor's office but healthcare is excited about voice (as one example) :).

Essentially, Home Assistant support in Willow will be one of the many integration modules we support, with Willow using as many bog-standard common denominator compliant protocols and transports that don't compromise our goals, while maintaining broad compatibility with just about any integration someone wants to use with Willow.

This is the very early initial release of Willow. We're happy for "end-users" to use it but we don't see the one-time configuration and build step being a huge blocker for our current target user - more technical early adopters who can stand a little pain ;).

[0] - https://www.youtube.com/watch?v=8ETQaLfoImc


Thanks for all your work!


So I was just looking at the installation process for this device's dev environment (ESP-IDF from espresiff) and it seems kind of...insane.

The manual install method in the directions is not manual at all. It's a script that calls several python scripts. One has 2660 LOC and installs a root certificate (hard coded in the script itself) because of course, even though you just cloned the whole repo, it still has to download stuff from the internet. According to the code, "This works around the issue with outdated certificate stores in some installations".

Does anyone familiar with espressiff have an actual manual method of installing a dev environment for this device that doesn't involved pwning myself?


yes, do it in a container or VM. Welcome to the wonderful world of hardware manufacturer SDKs.


Indeed. This is exactly the reason why we standardized on building in a container.


If you are open to Nix, you can try https://github.com/mirrexagon/nixpkgs-esp-dev. I used it for a small project a while ago and the experience was pretty good.


Nice!

For anyone who would try to use this with Willow (I like the effort and CERTAINLY don't love the ESP dev environment as-is):

- ESP ADF is actually the root of the project. ESP-IDF and all other sub-components are components themselves to ADF.

- We use bleeding edge ESP SR[0] that we also include as an ADF component.

- Plus LVGL, ESP-DSP, esp-lcd-touch, and likely others I'm forgetting ATM.

[0] - https://github.com/espressif/esp-sr


Congratulations! This is great news!

I do not see anything posted on the Home Assistant (HA) Community forums.

> Configuring and building Willow for the ESP BOX is a multi-step process. We're working on improving that but for now...

This is crucial as your "competitors" are ready out of the box. I believe HA can be a Google/Alexa alternative to the masses only if the "out-of-the-box" experience is comparable to the commercial solutions.

Good luck, and keep us updated!


Thanks!

HN was my first stop (of course) - I'll be heading over there shortly to post.

Oh yeah, we're well aware of how much of a "pain" getting Willow going can be. I don't like it (at all).

That said, you configure and build once for your environment and then get a .bin that can be flashed to the ESP BOX with anything that does ESP flashing (like various web interfaces, etc) or you can re-run the flash command across X devices. So even now, in this early stage, it's at least only painful once ;).

Down the road we want to have a Willow Home Assistant component that does everything inside of the HA dashboard so users (like esphome, maybe even using esphome) can point-click-configure-flash entirely from the HA dashboard. Not to mention ongoing dynamic configuration, over the air updates, etc.

I talk about all of this on our wiki[0].

[0] - https://github.com/toverainc/willow/wiki/Home-Assistant


IMHO better to release early like this to a group of hackers than to wait until you have a nice out of the box setup going. This way you're going to get a lot of great feedback and hopefully some help. Awesome project!


Bingo, thanks!


This looks like something I've been wanting to see for a while.

I currently have a google home and I'm getting increasingly fed up with it. Besides the privacy concerns, it seems like it's getting worse at being an assistant. I'll want my light turned on by saying "light 100" (for light to 100 percent) and it works about 80% of the time, but the others it starts playing a song with a similar name.

I'd be great if this allows limiting / customizing what words and actions you want.


Personally, I plugged a Jabra conference speaker to a Raspberry and if it hears something interesting, it sends to my local GPU computer for decoding (with whisper) + answer-getting + response sent back to the Raspberry as audio (with a model from coqui-ai/TTS but using more plain PyTorch). Works really nicely for having very local weather, calendar, ...


Neat!

If you don't mind my asking, what do you mean "if it hears something interesting"? Is that based on wake word, or always listen/process?


Both:

A long while ago, I wrote a little tutorial[0] on quantizing a speech commands network to the Raspberry. I used that to control lights directly and also for wake word detection.

More recently, I found that I can just use more classic VAD because my uses typically don't suffer if I turn on/off the microphone. My main goal is to not get out the mobile phone for information. That reduces the processing when I turn on the radio...

Not high-end as your solution, but nice enough for my purposes.

[0]. https://devblog.pytorchlightning.ai/applying-quantization-to...


Totally get it!

There are at least two ways to deal with this frustrating issue with Willow:

- With local command recognition via ESP SR command recognition runs completely on the device and the accepted command syntax is defined. It essentially does "fuzzy" matching to address your light command ("light 100") but there's no way it's going to send some random match to play music.

- When using the inference server -or- local recognition we send the speech to text output to the Home Assistant conversation/intents[0] API and you can define valid actions/matches there.

[0] - https://developers.home-assistant.io/docs/intent_index/


This drives me nuts and happens all the time as well. To be honest, I unplugged my google home device a while back and haven't missed it. It mostly ended up being a clock for me because I'd try to change the color of my lights to a color that it mustn't have been capable of because I'd have to sit there for minutes listening to it list stores in the area that might sell those colored lights or something. It wouldn't stop. This is just one of many frustrating experiences I'd had with that thing.


THIS. It's hilarious and infuriating our digital assistants struggle to understand variants of "set lights at X% intensity".

However, if I spend the time to configure a "scene" with the right presets, Google has no issue figuring it out.

If only it could notice regular patterns about light settings and offer suggestions that I could approve/deny.


I love seeing lots of practical refutations of the "we have to do the voice processing in the cloud for performance" rationales peddled by the various home 1984 surveillance box vendors.

It's actually faster to do it locally. They want it tethered to the cloud for surveillance.


We can do either.

For "basic" command recognition the ESP SR (speech recognition) library supports up to 400 defined speech commands that run completely on the device. For most people this is plenty to control devices around the home, etc. Because it is all local it's extremely fast - as I said in another comment pushing "Did that really just happen?" fast.

However, for cases where someone wants to be able to throw any kind of random speech at it "Hey Willow what is the weather in Sofia, Bulgaria?" that's probably beyond the fundamental capabilities of a device with enclosure, display, mics, etc that sells for $50.

That's why we plan to support any of the STT/TTS modules provided by Home Assistant to run on local Raspberry Pis or wherever they host HA. Additionally, we're open sourcing our extremely fast highly optimized Whisper/LLM/TTS inference server next week so people can self host that wherever they want.


first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.

checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.

the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.

besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"

https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/


Thanks!

There are at least two things here:

1) The ability to do speech to text on random speech. I'm going to stick by that description :). If you've ever watched a little kid play with Alexa it's definitely what you would call "random speech" haha!

2) The ability to satisfy the request (intent) of the text output. Up to and including current information via API, etc.

Our soon to be released highly optimized open source inference server uses Whisper and is ridiculously fast and accurate. Based on our testing with nieces and nephews we have "random speech" covered :). Our inference server also supports LLaMA, Vicuna, etc and can chain together STT -> LLM/API/etc -> TTS - with the output simply played over the Willow speaker and/or displayed on the LCD.

Our goal is to make a Willow Home Assistant component that assists with #2. There are plenty of HA integrations and components to do things like get weather in real time, in addition to satisfying user intent recognition. They have an entire platform for it[0]. Additionally, we will make our inference server implementation (that does truly unique things for Willow) available as just another TTS/STT integration option on top of the implementations they already support so you can use whatever you want, or send the audio output after wake to whatever you want like Vosk, Cheetah, etc, etc.

[0] - https://developers.home-assistant.io/docs/intent_index/


Ordered a box, can't wait to try this out! I've really been looking for something like this. My dream would be to have an LLM "agent" running locally, that knows who I am, etc, that can also double as a smart assistant for HA.


That's might closer than you think I guess. The thing is that the new Assist pipeline is fully customisable and can use other models as well. They already have a ChatGPT integration which does not able to control entities in HA but at least you can have a conversation with ChatGPT in speach trough HA.

So if you spin up somehow an LLM model locally and connect create a HA Assist pipeline with it and than you use Willow(s future release which should be able to leverage the new Assist featre) as a phisical interface than you are golden.

It may be hard or impossible today but I think within months HA and Willow will mature into a state where tha bigges problem will be the training and runing a good enough LLM model locally. But I bet a good amount of hackers are already hard working on that part anyway.


Starting with this post:

https://community.home-assistant.io/t/using-gpt3-and-shorcut...

I've been trying to adapt it to an offline LLM model, probably a LLaMA-like one using the llm package for Rust, or a ggml-based C implementation like llama.c.

It could even be fine-tuned or trained to perform better and always output only the json.

This could be a good fit with open sourced tovera when that is released.

I like the idea of supporting natural language commands that feel more natural and don't have to follow a specific syntax.

It can also process general LLM requests, possibly using a third-party LLM like Bard for more up to date responses.


I never really considered getting a home assistant doodad because of the privacy issues around them. This sounds like a cool project


What kind of privacz issues are you reffering to? Legit qustion btw, I'm not aware of any but would like to read about it.


Thanks!


What's the story for multiple devices being triggered by a single utterance of the wake word?

I have a Alexa or Google device in nearly every room so that '[wake word] lights [on|off]' or whatever does the right thing for that space. Alexa devices are pretty good about processing from the 'right' device when multiple are triggered. Google, not-so-much.

(Also a gap in both platforms is that they don't pass along the triggering device information)


I get really excited about this one!

Right now we don't do anything about it. BUT - I get excited because our wake word detection and speech rec is so good I have to go around my house and unplug all of my other devices when I'm doing development because otherwise a bunch of them wake. So it's good and bad right now :).

My thinking hasn't completely formed but I believe I have a few potential solutions to this issue in mind.

I've been replying to comments for 12 hours, can you let me slide on this one ;)? I promise we'll start discussing/working on it publicly fairly soon.


Sounds interesting - one question I have is about the mic array... Isn't this one of the supposed benefits of a physical Alexa device, and rumored to be sold at a loss because of the quality.

How does the esp-box compare? E.g. in a noisy environemnt, tv in the background, kids and dogs running around?


The ESP BOX has an acoustically optimized enclosure with dual microphones for noise cancelation, separation, etc.

Between that and the Espressif AFE (audio frontend interface) doing a bunch of DSP "stuff" in our testing it does remarkably well in noisy environments and far-field (25-30 feet) use cases.

Our inference server implementation (open source, releasing next week) uses a highly performance optimized Whisper which does famously well with less-than-ideal speech quality.

All in, even though it's all very early, it's very competitive with Echo, etc.


What’s the latency on inference on a rasbpi (I assume it’s not running it direct on the device)? I think I read previously that it was up to 7 secs, and if you wanted sub-second you’d need an i5.


Willow supports the Espressif ESP SR speech recognition framework to do completely on device speech recognition for up to 400 commands. When configured, we pull light and switch entities from home assistant and build the grammar to turn them on and off. There's no reason it has to be limited to that, we just need to do some extra work for better dynamic configuration and tighter integration with Home Assistant to allow users to define up to 400 commands to do whatever they want with their various entities.

With local command recognition with Willow I can turn my wemo switches on and off, completely on device, in roughly 300ms. That's not a typo. I'm going to make another demo video showing that.

We also support live streaming of audio after wake to our highly optimized Whisper inference server implementation (open source, releasing next week). That's what our current demo video uses[0]. It's really more intended for pro/commercial applications as it supports CPU but really flies with CUDA - where even on a GTX 1060 3GB you can do 3-5 seconds of speech in ~500ms or so.

We also plan to have a Willow Home Assistant component to support Willow "stuff" while enabling use of any of the STT/TTS modules in Home Assistant (including another component for our inference server you can self-host that does special Willow stuff).

[0] - https://www.youtube.com/watch?v=8ETQaLfoImc


Really interesting, thanks for replying.

I think controlling the odd device, setting a timer, adding items to a shopping list covers about 90% of my Alexa use. The remaining bits are asking it to play music, or dropping into another room. Seems like a good portion of these could be covered already.


Have you considered K2/Sherpa for ASR instead of ctranslate2/faster-whisper? It’s much better suited for streaming ASR (whisper transcribes 30 sec chunks, no streaming). They’re also working on adding context biasing using Aho-Corasick automata, to handle dynamic recognition of eg. contact list entries or music library titles (https://github.com/k2-fsa/icefall/pull/1038).


Whisper, per the model, does 30 second chunks and doesn't support "streaming" by the strict definition.

You'll be able to see when we release our inference server implementation next week that it's more than a version of "realtime" enough to fool nearly anyone, especially with an application like this where you aren't looking for model output in real time. You're streaming speech, buffering on the server, waiting for the end of voice activity detection, running Whisper, taking the transcription, and doing something with it. Other than a cool demo I'm not really sure what streaming ASR output provides but that's probably lack of imagination on my part :).

That said, these are great pointers and we're certainly not opposed to it! At the end of the day Willow does the "work on the ground" of detecting wake word, getting clean audio, and streaming the audio. Where it goes and what happens then is up to you! There's no reason at all we couldn't support streaming ASR output.


Where can I get one? I can't find it on Ali :(


You never know if people are going to love your pet project as much as you do. We had a hunch the community would appreciate Willow but like I said, you just never know.

My suspicion is Espressif (until now, hah) hasn't sold a lot of ESP Boxes. We were concerned that if Willow takes off they will sell out. That already appears to be happening.

Espressif has tremendous manufacturing capacity and we hope they will scale up ESP BOX production to meet demand now that (with Willow) it exists. The only gaiting item for them is probably the plastic enclosure and they should be able to figure out how to produce that en masse :).


I really hope so, I've been waiting for good audio assistant hardware forever. I hope this is finally the time where I ditch Alexa once and for all, thanks for releasing Willow!


fwiw I found them in stock on adafruit.com


Thanks, but I'm not in the US :( Good idea to check Pimoroni, though, thanks!

EDIT: Found one with a direct link from Ali from Espressif's site, even though it doesn't show up in a search:

https://www.aliexpress.com/item/1005003980216150.html?spm=a2...


Thanks for the link. ~$66 with shipping, worth it to test out this project though.


I’ve been living in a house for the past few months with a google assistant. I only use it to put on music, but I have noticed I play more music due to the ease of putting it on.

But I hate the privacy invasion aspect. I’m definitely in the market for something like this. And this one looks great.

Additionally, I’ve noticed that the google voice assistant (connected to Spotify) doesn’t keep playing the albums I ask for.

It states it’s playing the album. But after 4/5 songs it starts playing different songs, or different artists.


Music output is "on the list".

Biggest fundamental issue is the speaker built in the ESP BOX is optimized for speech and not going to impress anyone playing music.

That said, the ESP BOX (of course) supports bluetooth so we can definitely pair with a speaker you bring.

Willow is the first of it's kind that I'm aware of to enable this kind of functionality at anything close to this price point in the open source ecosystem. Either we (or someone else) is likely going to manufacture an improved ESP BOX with market competitive speakers built-in for music playback.

Then it's "just" a matter of actually getting the music audio but we'll figure that out ;).


Could we not use Willow to cast music, say, via Spotify or some other network streamer, through HA, to my pre-existing sound system?


The approach there would be to ignore Willow for music output and just do what it does today:

- Wake

- Get command

- Send to Home Assistant conversation/intents API[0]

- Home Assistant does whatever you define, including what you describe just like it does today

So unless I'm missing something your use case should "just work".

[0] - https://developers.home-assistant.io/docs/intent_index/


Nice. I guess I don’t expect willow to cover the speaker element. I’d rather connect with my existing hifi / Bluetooth speakers.

But with google I’m stuck with their integration to Spotify. It’s that component I’d like control over, and that’s why I’d use willow.

That and not being spied on in my own home.

Definitely keen for one.


The ESP Box with the ESP32 S3 has robust bluetooth support and I don't see A2DP/BT/pairing management/etc being that big of a lift. In full transparency it's probably towards the bottom on the priorities list ATM but the important thing is it's on the list already and it happens to be something I'm personally interested in :).


It also, at least in my case, frequently won't stop playing when you tell it to. And, if you want a song that has a title that isn't family friendly, it'll completely ignore that title and just play whatever the heck it wants.


Very interesting, I would buy an "off the shelf" version if it worked out of the box with Vicuna13 or similar LLM.


Our inference server (open source - releasing next week) has support for loading LLaMA and derivative models complete with 4-bit quantization, etc. I like Vicuna 13B myself :). Not to mention extremely fast and memory optimized Whisper via ctranslate2 and a bunch of our own tweaks.

Our inference server also supports long-lived sessions via WebRTC for transcription, etc applications ;).

You can chain speech to text -> LLM -> text to speech completely in the inference server and input/output through Willow, along with other APIs or whatever you want.


Awesome work! May I ask what are you using for text-to-speech?


Thanks, of course!

For wake word and voice activity detection, audio processing, etc we use the ESP SR (speech recognition) framework from Espressif[0]. For speech to text there are two options and more to come:

1) Completely on device command recognition using the ESP SR Multinet 6 model. Willow will (currently) pull your light and switch entities from Home Assistant and generate the grammar and command definition required by Multinet. We want to develop a Willow Home Assistant component that will provide tighter Willow integration with HA and allow users to do this point and click with dynamic updates for new/changed entities, different kinds of entities, etc all in the HA dashboard/config.

The only "issue" with Multinet is that it only supports 400 defined commands. You're not going to get something like "What's the weather like in $CITY?" out of it.

For that we have:

2-?) Our own highly optimized inference server using Whisper, LLamA/Vicuna, and Speecht5 from transformers (more to come soon). We're open sourcing it next week. Willow streams audio after wake in realtime, gets the STT output, and sends it wherever you want. With the Willow Home Assistant component (doesn't exist yet) it will sit in between our inference server implementation doing STT/TTS or any other STT/TTS implementation supported by Home Assistant and handle all of this for you - including chaining together other HA components, APIs, etc.

[0] - https://github.com/espressif/esp-sr


Wow this looks beyond epic. I've been looking for something like this.

Going to try to hack this into something my mom can use (who has trouble with confusion and memory). Could potentially be very great.

Thank you


Thanks!

We are really, truly, and seriously committed to building a device that with support from Home Assistant and other integrations doesn't leave any reason whatsoever to buy an Echo or similar creepy commercial device. No compromises on cost, performance, accuracy, speed, usability, functionality, etc.

We're really looking forward to getting additional testing and feedback from the community on speech recognition results, other integrations, etc. It's just two of us working on this part time over the last month or so - this is VERY early but I think we're off to a good start!


Wow yeah I think you're really onto something here. No one actually wants the creepiness from Echo or Alexa etc. That's what prevented me from trying any Home Assistant thing before, but I know it could be very useful if actually sensitive to privacy-concerns.

Best of luck with the development! I'll definitely be following closely. Do you sell the pre-built hardware yourself?


Thanks!

When you're releasing a pet project of love like this you never really know if other people are going to appreciate it as much as you do. Looking here on HN it seems like people appreciate it.

We don't sell the hardware currently because:

1) Espressif has well established sales channels and distribution worldwide.

2) It's not our "business model". In my capacity as advisor to a few startups in the space I've been approached by various commercial entities that want a hardware voice interface they fully control. In healthcare, for example, there are all kinds of interesting audio and speech applications but NO ONE, and I mean NO ONE is going to be ok with seeing an Echo in their doctor's office. That's where an ESP BOX or custom manufactured hardware and Willow come in.

Our business model is to combine our soon to be released very high performance inference and API server with Willow to support these commercial applications (and home users with HA, of course). In all but a few identified and very limited cases all work will come back to the open source projects like our inference server and Willow.


We're doing healthcare and this is exactly right. Perhaps I'll be contacting you in the future!


Let me know! Contact in profile.


This project reminds me of MyCroft https://github.com/MycroftAI/mycroft-core.


I think Mycroft is dead at this point. The project had some intertwined relationship with other projects and Neon AI. The suggestion from Mycroft at this point is to use NeonAI's OS for the Mycroft device:

https://neon.ai/NeonAIforMycroftMarkII


details from the mycroft forums:

https://community.mycroft.ai/t/faq-ovos-neon-and-the-future-...

Although MycroftAI, Inc. has ceased development, the Assistant survives.

A few years ago, some of MycroftAI’s partners started using @JarbasAl ‘s code (more information below) which eventually became a fork of the assistant. Now that MycroftAI is unable to continue development, the fork’s devs - The OVOS Foundation - have decided to take up leadership of the Assistant’s development and its open source ecosystem.

MycroftAI has signed over Mark II software support, as well as these forums, to one of those partners, a company called NeonAI. Between the OVOS Foundation and NeonAI, the voice assistant and the smart speaker project are getting a new lease on life.

The OVOS Assistant - it’ll get a better name soon, we promise - started out as a drop-in replacement for Mycroft. It should be compatible with all your classic Mycroft skills. It will even accept existing configuration files! Because we have been operating at a much smaller scale for the past three years, things will seem rough around the edges for a little while. However, we are scaling up. Read on.


Does it also work with the lite version of ESP box?


It does for everything but the touch display because it doesn't have one[0]. We don't support the three buttons at the bottom but I have two ESP BOX Lites and we should be able to make it happen pretty easily.

We haven't been focused on the ESP BOX Lite because it seems kind of limited. However, Espressif hasn't sold many of these things since release and judging from people looking for stock, etc in this thread I think that's about to change.

Espressif has incredible manufacturing capacity and our hope is they will ramp up manufacture of the ESP BOX family now because (to my knowledge) Willow is the first project that actually makes meaningful use of them.

The only gaiting component of the ESP BOX family that I can see is the plastic enclosure. I'm sure Espressif can figure out how to crank these things out ;).

[0] - https://github.com/toverainc/willow/wiki/Hardware


How is a Lite limited? My understanding is it's an ESP Box without touch/stand, but otherwise the same. Not quite sure why the Box retails for 28% more

https://www.espressif.com/en/news/ESP32-S3-BOX_video


It should. I personally haven't tested it as I don't own the lite version, but everything should work besides touch. It complains about not being able to initialize touch during boot, but that's not fatal. Shouldn't be too hard to ifdef away that error.


I think your post suggests the answer is yes, but do you think the ESP BOX is hardware is a good long term bet? That is, do you see Willow as working with ESP BOX for the foreseeable future with whatever improvements are planned? Just wondering if it's worthwhile investing in the hardware now even if it doesn't currently quite do what I want.


Absolutely.

I cannot stress enough what a gift the ESP BOX and Espressif component libraries are to Willow. As anyone who's dealt with it can tell you wake word detection (while minimizing false wake) and getting clean speech with acoustic echo, background noise, etc from 30ft away is still a fairly hard problem. I've been pretty deep in this space for a while and I'm not aware of anything even close to approaching open source that is even remotely competitive with the Espressif SR+AFE implementations. The ESP BOX has also been acoustically tuned by Espressif to address all of the weird enclosure issues with audio. Their AFE+SR interface has been tested and certified by Amazon themselves for use in Alexa ecosystem devices. It's truly excellent.

Espressif has an excellent track record for VERY long term hardware and software support and if anything we're on the very early bleeding edge of what the hardware and software components are capable of. As one example, we're using the ESP SR wake and local command support released by Espressif last week!


This is wonderful. I would love to replace my stupid Google Home Minis with this if I can actually get the hardware for $50. The Mycroft device is like $400 so I didn't even consider it, and I never understood why it had to be so expensive. I don't even need a screen - just a microphone. Will definitely give this a shot!


Thank you!

Yes, this is why we went through the pain of doing what we're able to do with this hardware.

Even in this initial release it's competitive with Echo, etc even on cost.


I love the privacy-focused aspect but playing devil's advocate: how could a device like this be hijacked and used for anti-privacy purposes? Does this require physical access or has it been subjected to the likes of the Black Hat conference to see if it can be owned from the street outside someone's home?


It's Wifi client only and supports WPA3, protected management frames, etc, etc. It doesn't listen on any network sockets. Even bluetooth is currently disabled.

Other than low level issues in the Espressif wifi stack (which is very robust, mature, and has been beat on heavily) I don't see any potential security issues.

That said the old expression "it's easy for someone to design a lock they can't pick" certainly applies.

We'd welcome someone owning it and bringing any issues to our attention!


Nice. Siri is completely unusable with an accuracy of less than 10%. I'm guessing Whisper on CPU is probably the same, so I wouldn't risk wasting time on trying the inference server if it only runs on CPU, but once that runs on GPU it would be cool to try this out.


GPU (currently CUDA only) is our primary target for our inference server implementation. It "runs" on CPU but our goal is to enable an ecosystem that is competitive with Alexa in every possible way and even with the amazing work of whisper.cpp and other efforts it's just not happening (yet).

We're aware that's controversial and not really applicable to many home users - that's why we want to support any TTS/STT engine on any hardware supported by Home Assistant (or elsewhere) in addition to ESP BOX on device local command recognition.

But for the people such as yourself, and other commercial/power/whatever users our inference server that we're releasing next week that works with Willow provides impressive results - on anything from a GTX 1060 to an H100 (we've tested and optimized for anything in between the two).

We use ctranslate2 (like faster-whisper) and some other optimizations for performance improvements and conservative VRAM usage. We can simultaneously load large-v2, medium, and base on a GTX 1060 3GB and handle requests without issue.

Again, it's controversial but the fact remains a $100 Tesla P4 that idles at 5 watts and has max TDP of 60 watts from eBay with our inference server implementation does the following:

large-v2, beam 5 - 3.8s of speech, inference time 1.1s

medium, beam 1 (suitable for Willow tasks) - 3.8s of speech, inference time 588ms

medium, beam 1 (suitable for Willow tasks), 29.2s of speech, inference time 1.6s

An RTX 4090 with large-v2, beam 5 does 3.8s of speech in 140ms and 29.2s of speech with medium beam 1 (greedy) in 84ms.


You've convinced me. Just ordered an ESP-BOX :p

Got a Home Assistant Yellow not long ago, so would be nice to get some decent voice control for it.


> Siri is completely unusable with an accuracy of less than 10%.

That seems unusual. I've been using both for the last few weeks while replacing my Homebridge setup, and Siri has been as accurate as Alexa — good enough that I've decided that I can now leave the Alexa ecosystem. To be more specific, both are (conservatively) 95%+ accurate for my home control scenarios.


I've never tried any voice recognition system that works well. Maybe my accent is too different from typical training data or something. I had a voice recognition program on my computer in 1994 that had about the same accuracy for me as any modern voice recognition system that I have tried.


Nice. Any bits of Mycroft in here? That project just imploded and I’m still sad about it.


Thanks!

None.

The ESP BOX and ESP SR speech recognition library from Espressif handles the low-level audio stuff like wake word detection, DSP work for quality voice, voice activity detection, etc to get usable far-field audio. The wake word engine uses models from Espressif with wake words like "Alexa", "Hi ESP", "Hi Lexin", etc. If we get traction Espressif can make us a wake engine model for whatever we want (we're thinking "Hi Willow") but open to better ideas!

We currently stream audio after wake in realtime to our very high performance (optimized for "realtime" speech) Whisper inference server implementation. We plan to open source this next week.

We also patched in support for the most recent ESP SR version that has their actually amazingly good Multinet 6 speech command model that does recognition of up to 400 commands completely on device after wake activation. We currently try to pull light and switch entities from your configured Home Assistant instance to build the speech commands but it's really janky. We're working on this.

The default currently is to use our best-effort hosted inference server implementation but like I say in the README, etc we're open sourcing that next week so anyone can stand it up and do all of this completely locally/inside your walls.


The "TTS Output" and "Audio on device" sections make it seem like there is no spoken output, only status beeps.

A former Mycroft dev, Michael Hansen[1], is still building several year-of-the-voice projects after he was let go. I'm especially excited about Piper[2], which is a C++/py alternative to Mimic3.

[1] https://github.com/synesthesiam [2] https://github.com/rhasspy/piper


We plan to make a Home Assistant Willow component to use any of their supported TTS modules to play speech output on device. We just didn't get to it yet.

Our inference server (open source, releasing next week) has highly optimized Whisper, LLaMA/Vicuna/etc, text to speech, etc implementations as well.

It's actually not that hard on the device - if the response from the HA component has audio, play it.

We just don't have the HA component yet :).


Imploded?



Can I replace google voice assistant on a pixel 7 with it? How about a rooted pixel 7?


I've never really used my Echo much, but they just went on sale, and I bought a bunch of them to wire up my place to play spotify across different rooms.

I'm amazed at how buggy it is, with spotify not playing regularly, etc.

I'm hoping there is a replacement that I can flash onto an echo speaker. I'd love something that just does the basics. I don't order things from Amazon or doing anything other than media with the speaker.


It has only gotten buggy recently. It used to work flawlessly for me, but in the past few months it's stopped playing songs, frequency announces it will play and then does nothing, or just forgets it can play music.

The other day it also forgot half my smart lights and has refused to rediscover them since. It's a dumpster fire now.


Looks like you made an impression and got featured on Ars...congrats!

https://arstechnica.com/gadgets/2023/05/willow-is-a-faster-s...


I wonder, would the ESP-S3-KORVO also work?

That might be a really nice option with more mics and a light ring.

https://es.aliexpress.com/item/1005003980436945.html

PS: It seems to have 3 mics but has solder pads for 6.. Weird


> I've always thought the ESP BOX[0] hardware is cool. I finally got around to starting a project to use the ESP BOX hardware

Is it actually for sale anywhere? Every single store I could find was sold out, and I checked within 5 minutes of this hitting the home page.


It looks great, but also like it might need to mature a bit before it's usable for less advanced users like myself. Do you have an RSS feed or newsletter I can subscribe to so I'm reminded some time from now to check it out again?


Thanks!

You're spot on - we're happy for anyone to test and use Willow but the intended users at this moment are early adopters that can build and flash as development is moving very, very quickly.

Unfortunately we do not, maybe try "watching" the repo on github?


Cool! What software is used for the wake word detection, speech to text and text to speech?


For wake word and voice activity detection, audio processing, etc we use the ESP SR (speech recognition) framework from Espressif[0].

For speech to text there are two options and more to come:

1) Completely on device command recognition using the ESP SR Multinet 6 model. Willow will (currently) pull your light and switch entities from Home Assistant and generate the grammar and command definition required by Multinet. We want to develop a Willow Home Assistant component that will provide tighter Willow integration with HA and allow users to do this point and click with dynamic updates for new/changed entities, different kinds of entities, etc all in the HA dashboard/config.

The only "issue" with Multinet is that it only supports 400 defined commands. You're not going to get something like "What's the weather like in $CITY?" out of it.

For that we have:

2-?) Our own highly optimized inference server using Whisper, LLamA/Vicuna, and Speecht5 from transformers (more to come soon). We're open sourcing it next week. Willow streams audio after wake in realtime, gets the STT output, and sends it wherever you want. With the Willow Home Assistant component (doesn't exist yet) it will sit in between our inference server implementation doing STT/TTS or any other STT/TTS implementation supported by Home Assistant and handle all of this for you.

[0] - https://github.com/espressif/esp-sr


Thanks so much for doing this, that sounds very exciting and I can't wait to try it out!


What are the biggest challenges that you see for improving it even further?

Looks really promising!


Thanks!

If I'm being perfectly honest I'm surprised we got it this far already. If I wanted to be really critical:

- Far-field speech is actually kind of hard. There are at least dozens of "knobs" we can tweak between the various component libraries, etc to improve speech quality and reliability for more users in more environments. We've tested as much as we can considering there's only two of us but we need more testing from more speakers in more environments.

- On the wire/protocol stuff. We're doing pretty rudimentary "open new connection, stream voice, POST somewhere". This adds extra latency and CPU usage because of repeated TLS handshakes, etc. We have plans to use Websockets and what-not to cut down on this.

- We don't really support audio playback yet. For a real "Amazon Echo" type experience you need to be able to ask it random things like "Hey what's the weather outside?" and it needs to "tell" you.

- Ecosystem support. Using the example above, something like Home Assistant or similar needs to know where you are, get the weather, do text to speech, etc for Willow to be able to play it back.

- Other integrations. Alexa has "skills" and stuff and we need to be able to talk to more things.

- UI/UX work. We support the touch display but we did just enough to show colors, print status, add a button, and make a touch cursor that follows your finger around. We also only give audio feedback with a kind-of annoying tone that beeps once for success and twice for failure.

- Speaking of failure, we don't do a great job of telling you what went wrong and where.

- Configuration and flashing. It's very static and has multiple steps. There are all kinds of things that need to get done to make Willow easy enough for less-technical users to deploy and actually use daily without any hassle.

- Local command recognition. It's very early but as noted in the README, wiki, etc the ESP BOX itself can recognize up to 400 commands directly on the device. In testing it works surprisingly well but we have a lot of work to do to make it actually practical for most people.

- Open sourcing our inference server. We plan to do this next week!


> the ESP BOX itself can recognize up to 400 commands directly on the device.

That's really cool! Does this mean 400 specific commands, e.g. "turn on the living room lights" or 400 commands that can be applied to different targets, e.g. "turn on the X lights" where X is some light. (400 actually feels like it would be enough to speed up the vast majority of interactions either way, but I'm curious :)


400 commands where "turn on X" is one and "turn off X" is two.

With Home Assistant this means turning on and off two hundred entities. We currently pull light and switch entities from Home Assistant and build the local Multinet speech grammar.

We have goals for better dynamic and adaptive configuration of Willow and part of that is using a Willow Home Assistant component with user configuration inthe HA dashboard, etc to easily select entities, define commands, etc and dynamically update all associated Willow devices.

We feel that with this 400 commands is enough to be practical and useful. Additionally, because the Multinet model returns probability on match to command "fuzzy matching" actually works quite well where "light", "lights", and slightly mis-worded commands still match correctly.


With regard to this:

> - On the wire/protocol stuff. We're doing pretty rudimentary "open new connection, stream voice, POST somewhere". This adds extra latency and CPU usage because of repeated TLS handshakes, etc. We have plans to use Websockets and what-not to cut down on this.

I've recently used the Noise protocol[1] to do some encrypted communication between two services I control but separated by the internet.

It was surprisingly easy!

[1]: https://noiseprotocol.org/


Thanks for mentioning noise! I've certainly looked at it before but our challenge is the sheer scope of what we're doing. Not to mention (similar to WebRTC that people have asked about) I'm not completely understanding the fit and benefit for our use case and application.

I talk about websockets because they achieve our mission and goal (in this case shaving milliseconds off command -> action -> confirmation) with robust, battle-tested client implementations already available in the ESP framework libraries. Same thing for MQTT. Both are supported by Home Assistant (and almost everything else in the space) today.

Because of this existing framework support, we'll have websockets done today-ish. Then we can (for now) move on to all of the other things people have asked for :). Hah, priorities!

Not saying Noise won't/can't ever happen - just that this is a very ambitious project as it stands and we have plenty of work to do all over the place :)!

Want to write a noise implementation for ESP IDF :)?


Nice! How's the speech recognition accuracy and response latency?


Thanks!

Faster than Alexa (and only going to get faster)[0].

Between the far-field speech optimizations provided by the ESP BOX and Espressif frameworks and our inference server (open sourcing next week) using Whisper, and our unique streaming format we've found it to be comparable in terms of quality to Alexa/Echo even with background noise and at distances of up to 30 feet.

[0] - https://www.youtube.com/watch?v=8ETQaLfoImc


That's really nice - and thanks for including the demo link too, impressive!


Thanks again!

Not only are we working on improving performance with the inference server, local on device command recognition is extremely fast. Like "did that really just happen?" fast.

In my local setup when using locally-controlled Wemo switches I swear the latency with local devices is around 300ms or so.

I should make another demo video with that...


> Open sourcing our inference server

I'm curious if this is something lightweight enough that might be possible to run as a Home Assistant add-on on relatively low-powered hardware such as an RPi.


I talk about this a bit on the wiki[0] but our goal is to have a Willow Home Assistant component do the Willow specific stuff and enable users to use any of the STT/TTS modules provided by Home Assistant.

We'll also (likely) be creating our own TTS/STT HA component for our inference server that does some special/unique things to support Willow.

[0] - https://github.com/toverainc/willow/wiki/Home-Assistant


Is there an analogous thing for RPi? I've got some old ones and a USB mic array from seeed etc that I've still not put to use. Also got an ESP32 (vanilla with a small oled) if I can use that?


The Home Assistant project (as part of the "year of voice") is working on wake word, etc for Raspberry Pi from what I understand. However, as someone who's tried to do exactly this on a Raspberry Pi before supporting wake word and getting clean audio from 25 feet away with background noise, acoustic echo, etc with a random collection of software and hardware is very challenging. I have an entire graveyard of mic arrays from seeed and others myself :).

Espressif really did us all a solid with this hardware and their ADF and SR frameworks.

Whether it's cost, being fully assembled and ready to go, and even wake word, AEC, AGC, BSS, NS, etc at least as of now the ESP BOX is essentially impossible to compete with in terms of hardware in the open ecosystem.

I talk about this and more on our wiki pages[0] (check out "Hardware" and "Home Assistant"). In short, the Espressif frameworks we use /technically/ support the "regular" ESP32 but it's so limited (and the ESP BOX/ESP S3 is so cheap) we're not super interested in supporting it.

We're aiming for an end-user experience that's competitive with Echo, Google Home, etc in every possible way - speed, quality, reliability, functionality, and cost.

In fact, we want to crush them on all points to where there's no reason left to buy one of them.

[0] - https://github.com/toverainc/willow/wiki/


Thanks, yes that's perfectly reasonable. Cheers for the reply.


Anyway to tie this into homeassistant? Siri is such a POS that I been debating for a while to replace it with something Whisper based.

Edit: should have read the README more carefully….


I'm really looking forward to adding something like this to my home.

Are there any similar devices, without a screen, available?

Preferably with a nice design, so I don't have to hide them?


The only real hardware requirement of Willow (currently) is something based on the ESP32 S3 that has a microphone.

I'm aware of various such devices out there but we've been really focusing on the ESP BOX. If you or anyone else in the community is interested in other devices we'll certainly look at supporting them!


Perhaps I am missing something obvious. Where can I buy an ESP-BOX? The github makes no mention. I can't find any on aliexpress or amazon.



We mention the names of a bunch of vendors but we don't have direct links. We probably should but I didn't want to wander into the waters of appearing to suggest/recommend one vendor over another.


You can't. They were sold out everywhere the moment this article hit hackernews.


adafruit


This looks amazing.


Thanks!


What sort of hardware is the inference server at infer.tovera.io currently running on? Tnx!


It looks like ESP32-S3-BOX is out of stock everywhere but AliExpress. Does this work on ESP32-S3-BOX-Lite?



Great. And its good that this is autonomous from Home Assistant. Why?

Don't get me wrong: I run HA for 2.5 years now. Its stable, most of the time. But changing scope and size will do what it does to all the othe project: Increase complexity and change focus.

And if one part gets changed too much: Some others - interwined - might not work or worse it breaks all the scripts.

This is why I have zigbee still running as a backup with deconz, and have the the wifi-power-plugs on tasmota (instead of custom integration with ha), etc. The smart-home-ecosystem is already like this https://xkcd.com/1810/ comic. But there is one difference: The more the systems are independent functioning, the more I can rely on them to work if one fails that is not interwined.


Is there a kit to be purchased?


That's the best part - the ESP BOX is not a kit. You take it out of the box, flash it, and put it on your kitchen counter or wherever you want.

The only challenge at this point has been interest in Willow. I've checked stock on ESP BOXes across various vendors and they are selling out.

However - Espressif has tremendous manufacturing capacity. From a review of the bill of materials for the ESP BOX as far as I can tell they don't make a lot of them because they haven't sold a lot of them. Until Willow, that is :).

We anticipate Espressif will ramp up manufacturing and crank these things out like hot cakes. No one wants another Rasperry Pi supply chain issue!


Well that's nice to know. Thanks for taking time to respond. The interest of repairability is there parts list? Will replacement parts be available to the end user?


Sure!

That's actually another great thing about the ESP BOX - the hardware is open as well. Schematics, BOM, Gerbers, etc have been made available by Espressif[0].

We don't currently have plans to get too involved on the hardware side but that could certainly change down the road.

[0] - https://github.com/espressif/esp-box/tree/master/hardware/es...


What are the advantages of Whisper over a custom app running on a cheaper Andriod tablet? How much benefit is derived from the dual microphones, for instance? Are there hardware-level APIs that Whisper offers which a custom Android app wouldn't be able to access?

I don't mean to nitpick, this looks really cool. I've just got a few old tablets lying around, and I'm trying to decide whether to spring for one of these instead of trying to make something work on what I've already got.


Willow uses a local ML model for wake word detection. Once wake is detected, the actual speech recognition has one of two user configurable modes:

Local - Willow also includes the latest ESP SR Multinet 6 command recognition model. Willow will automatically pull entities from Home Assistant (when using Home Assistant) and define the speech grammar based on the friendly entity names. In this mode, the speech/audio never leaves the device and the speech recognition result is sent directly to Home Assistant.

Server - In this case, after wake is detected we immediately begin streaming audio to our highly optimized inference server implementation (release next week). Once end of speech is triggered using the ESP BOX Voice Activity Detection we send an end marker to execute Whisper on the server side and then take the results and send them to Home Assistant (when Home Assistant is configured).

Doing reliable wake word detection and getting clean far-field speech (generally defined as 3 meters or more) in random/unknown environments that are typically less than ideal (background noise, acoustic echo, etc) is actually quite a challenge. Willow uses the dual microphones and the ESP SR AFE (audio front end) to do a variety of signal processing on the device to clean the speech.

The integration and engineering for anything resembling an Echo-like experience is very involved, down to physical attributes of the enclosure, microphone cavities, etc. There is an entire field and cottage industry of acoustic engineering on the hardware design for these applications.

The point is, providing an Echo/Willow-like experience is much, much, much more than putting a random microphone in a room. So with that, we don't plan to specifically support random devices because the outcome is almost certainly very poor and not something we're currently interested in supporting.

All of this keeps comes coming up and we will certainly document it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: