While most of the posts are, and probably should be, concerned with the privacy implications of this device if/when it reaches peoples' homes, I also wonder the plain-old "will this flop?" Adding voice recognition to things isn't a new idea, and the threshold for when it's "good enough" for the general populace for any use case is pretty poorly understood and/or quantified. Is this a use case people would be interested in transitioning to? Is this much better than just having a really good smartphone with voice recognition that's connected to speakers in the house? Will this get some success this holiday season? (If it won't come out of "invite-only" mode until after the holidays, will it see some success afterwards?)
Hard to say for me, but I feel like I can understand why Amazon wanted to try this out. In the worst case, it'll go the way of the fire phone and facebook phone and we'll forget in a year that this existed. At best, it finds its way to millions of homes and Amazon will have some epic access to peoples' lives.
i think you're underestimating how deeply entrenched 'surveillance as a business model' has become amongst leading american tech companies in recent years. this has much less to do with latent consumer demand than companies wanting to leverage that information for advertising and related purposes and desperately trying to craft a value proposition that justifies and normalizes more intrusive forms of data collection.
Can/will this lead to a stifling of true innovation? If this existed and there was one in that the famed Apple garage, or in the house rented by Zuck and his friends, would IBM have let Apple happen, or Google let the FB grow? How many prescient individuals (the future is already here, just not evenly distributed, as Gibson said) do you need to spy on to "manage" innovation in a way to prevent disruption? Could this surveillance era be the beginning of a technology dark age? What really disruptive things have happened since iphone(2007)?
This is the plot of the movie Antitrust. The idea of spying on potential competitors using surveillance didn't make sense in the movie and doesn't make sense in real life. If this were being done on a scale large enough to stifle innovation we would have heard about it a long time ago.
Really, the post-snowden paranoia is getting out of hand.
why would any company large enough to fund R&D take risks to develop something truly innovative when it could just combine incremental innovation rolled up into an advertising/commerce-linked platform?
especially for anyone trying to develop new hardware, patent barriers have made it much more risky and difficult--especially for small companies--to build things that are truly innovative or disruptive. and to the extent anyone does, they're likely to get bought out by a major company.
It seems to me that the main innovation here is the quality of the microphone array.
As a professional sound recordist, the #1 challenge of recording from a fixed point is that the ambient noise and reflections within the room rapidly swamp the original signal when you record from a point source. You can hear someone talk from the far side of a room in person very easily, because your brain constantly compensates for the acoustic environment it is currently in. But when you hear a recording made in a different acoustic environment (eg a scene in a movie) then your tolerance for background noise is far lower, because you become acutely aware that the acoustics are not responsive to positional adjustments - in much the same way that the image on a screen is limited to a plane.
So when recording sound for film or video, we tend to use special microphones with long barrels (which are highly directional) or fit actors and/or sets with very small microphones that only pick up sounds in close proximity and then transmit them by radio or wire. There are also parabolic microphones, but they're unwieldy and hard to focus plus they still pick up a lot of ambience, so they're better for things like sporting events where players repeatedly stand in predictable positions. The aim in recording sound this way is to get the actor's vocal performance with as little ambient noise as possible, which is then supplemented in post-production with additional recordings of background elements that can be layered in a controlled fashion. When recording on location rather on a sound stage, a large percentage of the takes are made for sound reasons; you would not believe how noisy the world is until you start trying to make quiet recordings of it. On almost every film project I have to have an argument with the producers at the early stage to be allowed (and paid) to come on location scouts, because most people are incapable of assessing the noise level of a location - their brains are so good at filtering out ambient noise and focusing on the conversations they're having about how the place looks that they are oblivious to how it sounds! I've been taken to what I was told was a quiet location only to discover that it was in the flight path of an airport 8-o
Anyway, the nice thing about this machine is the differential microphone array at the top. As well as providing a more accurate signal by simple differentiation, recording the device's own output and measuring what comes back in allows it to acoustically model the space it is in and then subtract that model from the input stream so as to isolate command spoken from across the room. I'd guess that most of this signal processing takes place on a DSP, and that the actual speech recognition is done in the cloud - though maybe not, as cheap CPUs pack so much punch nowadays. If you could hear the input to the speech recognition subsystem, it would sound oddly attenuated as it is stripped of any acoustic cues whatsoever.
I think the device will succeed or fail based on how semantically responsive it is - although different people will have different expectations and tolerances. For example:
You: Echo, I want to hear some new music!
Echo: How about the new album from XYZ?
You: Sure, I'll give that a try.
(music plays)
You: Echo, this music sucks.
(music keeps playing)
If Amazon (or anyone) can get a leg up on this sort of responsive conversation rather than just requiring the user to dictate commands all the time, they'll have a winner, even if it's little more than an Eliza front-end to a search engine.
You make it sound like nobody has a "leg up" semantic voice commands. But we do and it is perfectly usable, at least on Android devices where you can ask these questions to Google Now:
Q: How tall is the Empire State Building?
Q: When was it built?
Q: Show me Italian restaurants nearby.
...
(I am not familiar with how Siri or Cortana handle similar queries.)
This is cool - I didn't see any detail of the mic array on the site, where did you find it?
Also, do you think it's feasible for Amazon to keep a voice profile on the speakers? I'm thinking if they are going to tout perfect voice recognition they'll have to make it person-specific at some point.
Well, it says: Tucked under Echo's light ring is an array of seven microphones. These sensors use beam-forming technology to hear you from any direction. With enhanced noise cancellation, Echo can hear you ask a question even while it's playing music.
I've been working professionally with digital audio for nearly 20 years now so I know a fair amount about DSP, acoustics and so on. Very basically, you can measure the acoustical properties of a space by playing a sound known as an impulse and recording the response, and then extracting the acoustical information by a mathematical technique known as deconvolution. This is used in various commercial products for allowing you to simulate, say, the reverberant space of Sydney Opera Hall on a recording made in a vocal booth, or reproduce the signature tone of a hideously expensive guitar amplifier in a cheap DSP-powered device.
When you have hardware where the speaker and microphones exist in a fixed physical configuration relative to each other, as here, then the math gets that much simpler because a lot of your coefficients become fixed quantities. With multiple microphones at fixed distances from each other you can use small discrepancies in the phase of the input audio to infer information about spatial characteristics of the environment. I don't know the exact dimensions of this thing but just eyeballing I'd guess that you could hack this thing to produce a reflectance map with a resolution of maybe under an inch.
Wow, thinking about it I hope it is hackable. Even if you were only able to get the raw input stream from the microphones and had to import the audio to another machine for all the DSP, a perfectly-calibrated speaker + phased microphone array for $200 is a steal.
I think Google's voice search already does this (feeding into audio captchas), but obviously there is a smaller crowd from which to source free audio recognition assuming that people who are both sighted and hearing typically prefer visual captchas due to environmental constraints.
Most of the privacy concerns voiced in this threat don't sound like anything new to me, but this one made me take notice. You're suggesting that Google takes random queries from individuals and serves them as captchas to other random people? That sounds like a privacy disaster. Most of the time the queries would be anonymous, but it's certainly not guaranteed there wouldn't be identifying information.
>Is this much better than just having a really good smartphone with voice recognition that's connected to speakers in the house?
It doesn't look like it. Maybe it has a better microphone? In any case, this seems like a function that could be just as easily accomplished by a smartphone. (Maybe this is a wasted Fire-phone opportunity?)
I think what makes it better is that it's completely hands free and accessible to everyone in the vicinity.
I already have a smart phone with Google Now, and I have a Sonos, but I'd still consider getting this to solve this common use case in our household:
Every morning my wife or daughter asks what the weather is going to be like. My wife could ask Siri, but she doesn't always have her iPhone at hand. I always have my phone, so I ask Google Now. I think it would be fun to have an Echo in the kitchen so my wife or daughter could just ask and get an answer. And it goes way beyond that. My daughter loves taking my phone and asking Google Now silly questions. It is high entertainment for her. Echo would be a device that she could interact with without having to co-opt my (or my wife's) phone.
Worst case, it's a music streaming speaker that doesn't require you to stream via bluetooth. I wonder how the audio stacks up against something like the Bose SoundLink.
I've been using a chromecast + an HDMI audio splitter as a cheap way to stream. It works really well for apps that support chromecast, such as pandora.
Hard to say for me, but I feel like I can understand why Amazon wanted to try this out. In the worst case, it'll go the way of the fire phone and facebook phone and we'll forget in a year that this existed. At best, it finds its way to millions of homes and Amazon will have some epic access to peoples' lives.