I love seeing lots of practical refutations of the "we have to do the voice proc...

kkielhofner · on May 15, 2023

We can do either.

For "basic" command recognition the ESP SR (speech recognition) library supports up to 400 defined speech commands that run completely on the device. For most people this is plenty to control devices around the home, etc. Because it is all local it's extremely fast - as I said in another comment pushing "Did that really just happen?" fast.

However, for cases where someone wants to be able to throw any kind of random speech at it "Hey Willow what is the weather in Sofia, Bulgaria?" that's probably beyond the fundamental capabilities of a device with enclosure, display, mics, etc that sells for $50.

That's why we plan to support any of the STT/TTS modules provided by Home Assistant to run on local Raspberry Pis or wherever they host HA. Additionally, we're open sourcing our extremely fast highly optimized Whisper/LLM/TTS inference server next week so people can self host that wherever they want.

java_beyb · on May 15, 2023

first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.

checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.

the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.

besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"

https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/

kkielhofner · on May 15, 2023

Thanks!

There are at least two things here:

1) The ability to do speech to text on random speech. I'm going to stick by that description :). If you've ever watched a little kid play with Alexa it's definitely what you would call "random speech" haha!

2) The ability to satisfy the request (intent) of the text output. Up to and including current information via API, etc.

Our soon to be released highly optimized open source inference server uses Whisper and is ridiculously fast and accurate. Based on our testing with nieces and nephews we have "random speech" covered :). Our inference server also supports LLaMA, Vicuna, etc and can chain together STT -> LLM/API/etc -> TTS - with the output simply played over the Willow speaker and/or displayed on the LCD.

Our goal is to make a Willow Home Assistant component that assists with #2. There are plenty of HA integrations and components to do things like get weather in real time, in addition to satisfying user intent recognition. They have an entire platform for it[0]. Additionally, we will make our inference server implementation (that does truly unique things for Willow) available as just another TTS/STT integration option on top of the implementations they already support so you can use whatever you want, or send the audio output after wake to whatever you want like Vosk, Cheetah, etc, etc.

[0] - https://developers.home-assistant.io/docs/intent_index/