Now that we have open-source high-quality speech-to-text (OpenAI Whisper) availa...

shagie · on Nov 20, 2022

The issue is not speech to text but rather "now that you've got the text, what do you do with it?"

How do you make "play my favorite songs" to "play some music that I like" to do similar (if not identical) things? What infrastructure behind the voice assistant will need to be made accessible and how do you hook that up?

This is where Siri and Alexa work well. They've got operating system level access to the rest of their eco system. Alexa has access to all your Amazon stuff. Siri has access to all your apple stuff.

Fabricio20 · on Nov 20, 2022

I'm pretty sure that speech-to-text is still a major issue. Most solutions available right now are all too heavy to run on-device, there's a reason only wakeword detection is done on device and the audio is actually transcribed on the cloud. Even the one the parent linked, OpenAI Whisper, seems to require a beefy machine to run.

When we get models/apis that can run on a raspberry pi in real time then we can say the issue is ecosystem access. Right now i'm unable to build my own custom assistant, simply because I have not yet found a speech-to-text model that can run on my low power devices - even the one on my top-of-the-line android phone needs to call out to Google for recognition!

shagie · on Nov 20, 2022

If you've got an iPhone, kick it into airplane made and open up "notes" and tap the microphone to transcribe.

It even gets "one third plus one half is five sixths" transcribed as `1/3 + 1/2 is 5/6` and "four feet two inches" becomes `4'2"` without a network connection.

So I believe its doable... I also believe that Apple spent bit to make it that way. Note that the voice to text on the device cuts down on data use (I believe - I haven't sniffed the traffic to verify) and cloud side processing (its working on text rather than audio - its a few pennies less per request).

But beyond that... every home voice assistant that I've seen (other than Apple, Amazon, and Google) has worked on pre-programed phrases that need to be spoken rather than the more free form way people speak... and again, that's even without trying to translate that into other infrastructure beyond the device.

I'd honestly be happy/impressed to have a RPi based system that can handle taking "create a reminder for next Wednesday at 9:30 am to thaw the turkey" and have that feed into other useful systems.

I'd be very impressed if I could do a "create a reminder" (what is the reminder for) "go to parents" (when should I remind you) "Day after thanksgiving" (what time on Friday should I remind you) "half past one" (is that one thirty in the morning or in the afternoon) "afternoon" -- this works with Alexa. Note the recognition of holidays, the 'wizard' style entry and the alternate time format.

fold3 · on Nov 20, 2022

Simple: the openai assistant would play songs from your jellyfin/Plex/mpd server hooked with last.fm/discogs api to find similar artists. Its not there yet but it is certainly possible.

shagie · on Nov 20, 2022

If its programmed so that the phrase "play my favorite songs" with exactly those words triggers this sequence of calls - yes, that's quite doable.

The difficulty is making to so that "play my favorite songs" along with "play some music I like" or any of the other possible variations do that same thing - that's where it gets difficult... at least quite a bit more difficult on a RPi.

Even beyond that, the "you need to spin up a Plex server locally on your home intranet, create an account on last.fm and enter these values into this application" isn't there yet. Its doable for a person with a moderate home lab and the familiarity with these systems, but its not a product that you can have your elderly relative who is in the range of "not computer literate" to "Ok with using Windows but uncomfortable writing something along the lines of `Get-Date | ConvertTo-Json` in powershell."

Possible? With sufficient knowledge, acceptance of specific incantations rather than intents, and troubleshooting skill - certainly.

Productized for regular use outside of the techie circles at a reasonable price? A long way from it.