How would you make it self-hosted without making it suck? High quality voice recognition in a small box doesn't seem to be a thing that's even remotely possible today, let alone the query processing and knowledge database that comes with it.
You could build this on a pi with a mic, speakers, some foss stt and tts engines and some basic training data. But it'll suck.
Ten years ago I played with Microsoft Speech API - which was completely off-line and trained off your voice. In restricted grammar mode, it worked flawlessly - I built a music control application on it, and utilized it like you would use Amazon Echo - I just said "computer, volume, three quarters" from any place in the room, and the loud music turned down a notch. Etc. That was ten years ago, with a crappy electret microphone I soldered to a cable myself and sticked to my wardrobe with a bit of insulating tape.
I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system. Sure, it may not be as good as Echo or Google Now (though the latter does suck hardly at times, it's nowhere near reliable to use, and it doesn't understand shit over a quite good and expensive Bluetooth headset). But it would be hackable, customizable. You could make it do some actual work for you.
Oh, and it wouldn't lag so terribly as Google Now does. Realtime applications and data over mobile networks don't mix.
But we're getting close to the point where you can do some of this. For example - http://arxiv.org/pdf/1603.03185.pdf - LSTM speech recognition running on a Nexus 5.
The more serious problem with this is that it's going to be expensive -- and somewhat wasteful. There's a lot of pressure to keep consumer devices as cheap as possible, and the cloud is an awesome way to do that. Having shared cloud-based infrastructure for the speech recognition as opposed to putting it into every device (even though it's only used for ~5 minutes every day) is probably a lot cheaper. Consider the hardware in an Amazon Echo:
vs. a Nexus 5 (2GB DRAM, 4 core 2.2Ghz Krait 400) -- the N5 has roughly 8x the DRAM and compute of the CPU in the Echo.
Would you pay an extra $150 for a LocalEcho that still had to send most of your queries to a search engine for resolution, or to a cloud music service for music? (You & I might, but most consumers wouldn't.)
> I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system.
I agree. It's not a problem of technology, it's a problem of incentive. There's no money in developing self-contained, off-line speech recognition system, unfortunately.
> There's no money in developing self-contained, off-line speech recognition system
Nonsense. Self-hosting is highly valued in the enterprise sector. But we're not talking about the sort of products that could be sold to consumers for a few hundred dollars here.
A desktop PC is more than able to do good speech recognition as long as it's able to train the model for individual voices. Getting good results without training the model for the user beforehand is harder, and you would probably never be quite as good as a cloud-based system.
A Pi, though, couldn't do well at all, just like you said. If I wanted to build a system like this for myself, I would target an HTPC form factor.
edit: Another possibility, which was explored elsewhere in this thread, would be to keep the listening device "thin", but have the ability to offload the processing to a machine in my LAN instead of one the "cloud".
Hey, people with experience in speech recognition, please chime in!
Just the other day I was looking at CMU's Sphinx project for speech recognition. It seems quite capable, even of building something like this Google thing, but I haven't tried to actually use it.
Large-vocabulary recognition probably needs something better than a Raspberry Pi... so, just use a more powerful CPU.
Yes, Google has an incomprehensibly enormous database of proprietary knowledge and information. Good for them! If we want to build a home assistant that doesn't depend on Google, we'll have to make tradeoffs. That doesn't mean it has to suck.
I have "Offline speech recognition" with Google Voice Typing that seems to work perfectly well in airplane mode. The downloaded language pack (English) is 39 MB.
Here is the problem, not all devices you could work with it are self-hosted and doesn't allow cloud interactions. Now if you're talking about Home's dependence on a cloud for local interaction, then I get you.
But, on the other side, if it's not open and you can't use any device with it... I'm going to be really upset on a personal level.
The reasons consumer IoT isn't huge yet are:
1) Disparate connection types (e.g., I could buy Z-wave, Wifi, BLE, etc and they all onboard differently)
2) I can't choose which device I want to use with which platform because of politics.
Some of these devices (thermostats or security systems for instance) aren't impulse buys. If I have a Honeywell thermostat, and Home doesn't support it, I either buy a new thermostat or don't buy Home.
I rather suspect that the knowledge graph it uses is a rather hefty dataset. Probably not suitable for a home installation. And how would you keep it up-to-date without the cloud? Would you have it scrape websites and consume feeds itself?
Knowledge graph could be a separate service. It handles only a subset of requests anyway; no reason for the request itself not to make a "pit stop" under my control before it is sent to fetch data. You could also use more than one provider of a knowledge graph in this case.
The more important aspect of it is fixing the problems with said knowledge graph. For instance, Google doesn't have the data on the public transportation in my city. I could easily write a scrapper that would fetch me the bus/tram timetables - but there's no way to integrate that source of data with Google Now. It's one example, but in practice Google's knowledge graph is pretty much useless for me. At best, it can answer me some trivia questions sometimes.
> Please, please, please be a completely open, extensible platform...
That's one. The second one is, please make it self-hosted. No cloud bullshit.
I know I'll probably never live to see the second one coming true.