The people losing their marbles over this being some kind of Turing Test passing distopian stuff are missing the point at how limited this domain is.
People who answer phones to take bookings perform an extremely limited set of questions and responses, that’s why they can even be replaced by dumb voice response systems in many cases.
In these cases, the human being answering the phone is themselves acting like a bot following a repetitive script.
Duplex seems trained against this corpus. The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.
Most people working in hair salons or restaurants are very busy with customers and don’t want to handle these calls, so I think the reverse of this duplex system, a more natural voice booking system for small businesses would help the immensely free up their workers to focus on customers.
> The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.
And looking even further into the future, we can imagine a day when the computers forgo natural speech and use a better-suited form of communication. Some kind of sequence of ones and zeros transmitted directly across the wire.
Lol, but if you think about it, what stops businesses from doing this today?
It's the lack of a universal API.
If a barber shop wants to make it possible for a 3rd party app to book appointments then they have to release some API. But that's not the end of it. The 3rd party app has to first discover their Api, someone has to understand it and write code to use it, and then deploy that code.
This is a problem today because there is no universal Api that all services can use
With Duplex, verbals speech becomes a universal Api that every service can parse and communicate to each other wtih. Also, the discoverability is taken care of by using publicly cataloged phone numbers on services like Google Maps, Yelp, etc
Microsoft and others enterprise selling vendors loved the end goal back in early 2000s - the universal API solved by middleware. That's why you had Biztalk and Biztalk consultants that made more than SAP consultants (think todays crazy Salesfarce consultants that compete for gamification badges). For example you could be a small insurance company submitting to a larger underwriter, and when you work out the transactions per month you have to take $5 off each app just to pay for biztalk infrastructure and licensing. People rode that gravy train hard. I'd be surprise if any of the biztalk shit still remained though, grand goals means juice enterprise sales. Oracle had a similarly crap product that was equally slow, painful and verbose, can't recall the name. XML and it's lofty goals beyond what it was can be compared to today's ICO toxic industry, no reflection on XML itself though.
The "badges" are not won in competions, but instead watered down tutorial gold stars. Instead of targeting real programmers Salesforce built "trailhead" for John in finance who wrote some Excel macro and decided he should become a Salesforce "developer". Thats why the small percentage of us consultants who actually come from a cs background can charge so much.
I read a great article about XML that spelled out that XML isn't a "language" or protocol, it's an "alphabet". It gives you the building blocks but what you build with it but it doesn't translate from one language to another. (I guess that's what XSLT is meant to do but it's still not magic.)
Much of the "semantic web" work was directed at similar things. The reason we don't have it already after 20 years of web commerce is that it's an adversarial process. The business wants to mislead, upsell, or discourage a customer from asking for support; and likewise some small fraction of customers are looking to exploit the business.
This thread now contains a loop. Granted, it's a really long loop, all the way back to 1956, but it's a loop all right: you've essentially summarized the Dartmouth Conference and its assumption "gee, that ought be just a minor research subject - easy, right?"
Natural language is not an API. But a business can assume (or weed out quickly) that caller is calling to hit a specific API endpoint. And as such their language will eventually lead to one of those endpoints.
And similarly, a caller can assume the business they are calling is trying to lead them to an endpoint.
In both cases, a set of assumptions lets natural language act as an API, even if neither end could pass a Turing test with someone who wasn't interested in any of the endpoints.
edit: I read your point about language changing, and that is true. But if we only have machines using the language with each other (no more training with humans), we can also assume the language won't change.
APIs map to a single canonical concept. A "class" in Java actually has a correct definition. There are a nontrivial number of philosophers who believe that this isn't true for language.
To simplify that and put it in more technical terms, an API is perscriptive and a language is descriptive.
If a bunch of coders decide to start capitalizing "Class" their code won't compile. If enough people start using the word "aint" it becomes a word, regardless of what the dictionary says (see "irregardless"). There is no single authority that can decide what is and isn't a canonical definition.
This is why spoken languages evolve so much. Even languages where we've explicitly tried to go the opposite direction (like Esperanto) have evolved into multiple dialects, where subsets of the community simply ignore the standards and still communicate with each other just fine.
Note that this is the opposite of what you want with a federated, universal API. The whole point of an API is to standardize between unfamiliar devices. Language is actually pretty bad at standardizing communication between unfamiliar people. Even in the US, different regions and communities use different euphemisms, terms, and definitions.
You're right that the language is not the API, it is the medium. The "appointment booking API" exists only in our heads, it is a set of mutually understood conventions for interpreting a subset of language in order to create a record of the appointment somewhere.
If you call up a hair salon and start reciting poetry you'll get an error response in the same way as you would if you had sent malformed JSON to an endpoint. If you stick to the expected script you'll achieve success almost all of the time.
I wouldn't be surprised if a majority of human communication works this way, especially when it involves individuals who do not know each other. We have agreed upon limits, key phrases and words, and expected responses that allow most of the unpredictable stuff to be ruled out. All of that favours automation.
> We have agreed upon limits, key phrases and words, and expected responses that allow most of the unpredictable stuff to be ruled out. All of that favours automation.
I'm not sure I'd disagree, but it seems you're just describing a domain specific language in a more roundabout way. We have a ton of protocols that introduce a set of limits, key phrases, words, and responses. Java, Network protocols, XML, JSON, etc...
Assuming that you're correct, does it make sense to then assume that it'll be an improvement to standardize English rather than a set of IP headers? The agreed upon standards in English are (for the most part) informal, evolve constantly, and are hard to teach to computers. Automation favors predictability, and even the most generous interpretation of a natural language leaves me feeling like it's a step backwards.
We're going to standardize on an appointment booking API that exists only in our heads, that can only be taught using ML, and that is guaranteed to change over time in unpredictable ways? That seems wrong to me.
Just to add a fun example to demonstrate your point, the original name for Esperanto wasn’t “Esperanto”, it was (translated) “the international language”. It was given the nickname “Esperanto” after the chosen name of the creator, and the word itself it supposed to mean “one who hopes”.
Pedantry ahead: the various implementations of (J(ava)?|ECMA)Script tend to disagree with the point on pREscription and non-divergence (yay 20 MB of stubs and libraries for sort-of-coherent behavior). Moreover, the example of "Class" is just a pre-compile check convention, the compiler doesn't care.
(Postel w/r/t API - Does that make any sense? Or is that a fancy way to say DWIM?)
:) I don't think that's pedantic, it's a really good point to bring up.
On the web side of things, the W3C often describes their role as being partially descriptive.
From their doc on the Web of Things[0]: "The Web of Things is descriptive, not prescriptive, and so is generally designed to support the security models and mechanisms of the systems it describes, not introduce new ones... while we provide examples and recommendations based on the best available practices in the industry, this document contains informative statements only."
This is exactly for the reason you mention - if browsers collectively decide to go in a different direction, what the W3C says doesn't matter. The web standard is what the browsers do.
However, two things to keep in mind:
Even where browsers are concerned, there is still an API and a canonical version of "correct" for each browser. What we're trying to do is get those APIs to be compatible and consistent with each other.
Many people believe that language even on an individual level doesn't directly map to an actual reality; in web standards that would be like the browsers themselves not having their own consistent API.
But assume those people are wrong for a sec. Let's assume that language is just a standardization problem between different communities and individuals. Well, the W3C should teach us that even in the realm of computing, standardization is really stinking hard.
So even in that scenario, we have to ask whether standardization becomes easier or harder when every single individual in a community has the ability to change norms or introduce more language. We can't even get 3-4 browser manufacturers to agree on a single API, now imagine if every single hair salon owner could increase divergence whenever they wanted just by answering phone calls differently.
True, many people know english, but that doesn't mean they know statistics or some other complex domain. "Show me a k-means cluster of this dataset", is likely to be parsed but not understood by many english speakers.
I was attempting to communicate to the parent and grandparent that transcribing/parsing english and communication are different. We both need to share the model being spoken of to transfer or communicate the knowledge.
In a sense you are correct, but remember that Duplex only works because it is limited to a very strict well specified domain: scheduling for restaraunts and hair salons. That is, Duplex only works because there already is a de facto specification for these transactions. It requires a lot less effort overall to just formally specify this de facto standard and deploy it once and for all, but of course a top down approach rarely works. This actually reminds me of the failure of the industry to adopt semantic web technologies. Many organizations are providing the same services and could easily adopt a common “API.”
> Many organizations are providing the same services and could easily adopt a common “API.”
It's not in the interest of those organizations to adopt a common API. Everyone wants to suck in data and be the platform; nobody wants to give data away.
But that's the reason why Duplex is interesting, and why there's a grain of truth in what ZainRiz is saying:
Humans have settled on a de facto API for scheduling appointments. It uses the telephone as its interface, speech as its medium, and Duplex is exploiting it.
You first need "ye old barber shop" and "big corp barber shop" to all agree on what this common API should be. That's the old standards proliferation problem https://xkcd.com/927/
Which is what makes it very hard to define a new common API
However, they all already agree on the standard for natural language communication (in the context of a strict, well defined domain). That's the pre-existing common API which Duplex is using
It's basically what 'pjc50 said - it's not in the interest of businesses to expose their data via a common API.
WRT. using English as universal API, I think this is just dumb. You solve exactly zero problems by going that route, because the actual problems to solve (beyond no incentive for businesses to care) are exactly the same as you have with XML APIs, or any other APIs. The problems of discoverability and machine understanding is something the Semantic Web space has been dealing with for quite a while, and other people before that. Adding natural language to the mix only makes the job significantly more difficult, because you now have to deal with natural language parsing/understanding.
This sort of thing is exactly why the healthcare industry still uses faxes, even going electronic charts -> pdf -> fax -> pdf -> electronic charts in some cases.
This is even more fun because in the modern age, what very often ends up happening is:
electronic charts -> pdf -> fax -> fax machine as a service -> unsecured email -> pdf -> electronic charts
Compliance can sometimes help, but ultimately the data needs to flow, and people will do whatever it takes to make that happen. Until security is so easy that it's the default, these little loopholes will continue to be abused.
Phaxio co-founder here. We do a _ton_ of heathcare faxing and we're starting to see a shift away from the "unsecure email" in applications. Granted, we can't see what our users are doing at all times but being HIPAA compliant ourselves, we often work with our users to understand their systems and guide them towards compliance.
>> Until security is so easy that it's the default, these little loopholes will continue to be abused.
The simple way to think about this is that the government is more worried about unsecure email/email spoofing than it is about wiretapping.
Healthcare uses faxes mostly because HIPAA rules particular to format and security of electronic communications don't apply to faxes; it's a compliance hack.
I've literally been in the room when legal and compliance offices gave the advice on both the construction of the relevant regulations and industry practices on which a payer relied on in deciding to use a process that created paper documents then faxed them for certain purposes, but, no, there's nothing published I can link to as to that being the reason industry players make that decision.
I can, however, point you to the relevant section of HIPAA regulations on which it rests, the definition of “electronic media” at 45 CFR § 160.103, specifically this bit: “Certain transmissions, including of paper, via facsimile, and of voice, via telephone, are not considered to be transmissions via electronic media if the information being exchanged did not exist in electronic form immediately before the transmission.”
The one or both speakers could use a handshake noise at the start of the call to tell the receiver that it's capable of "speaking" a modem protocol. It might change a little every time, or be of an especially low or high frequency so that a person doesn't realize they're talking with a computer. After handshaking, the receiver could send a URL that would allow the channel to be upgraded to the Internet... or not. English is a good fallback if both people speak it and you can't find a more efficient channel.
> The 3rd party app has to first discover their API, someone has to understand it and write code to use it, and then deploy that code.
I think this would be a better integration point for AI. It could look at the fields and learn to fill them out automatically (name, age) and prompt the user for anything missing. Then instead of the barber shop needing a universal AI users just need their personal AI (or a script) to interact with the API.
It reminds me of the maybe apocryphal story how NASA invested time, money, and effort to develop a pen so astronauts could write in zero gravity, and the Soviets used pencils.
That doesn't sound right. I think you'd pay something like $99 per month for a SaaS product that manages bookings and provides an API. That's how much the average receptionist earns in one day ($12.25 per hour.)
Well there is initial setup at the very least. Hooking up whatever landline the client has with the SaaS solution.
Then whenever a significant change is required, you need to call back your "expert". New location? xk$, etc.
But I think the biggest concern is that suddently the owner does not understand how his reversation system works. He used to be able to call Joe and know what's going on...
Someone needs to maintain that API; update it with the schedule of the stylists or update it for holiday hours, or remove availability for bookings done the old fashioned way.
"This is a problem today because there is no universal Api that all services can use"
"It's the lack of a universal API."
I disagree with that. We already have universal APIS. Adopting a newly established Universal Api is far more painful and has slower adoption rate than using the existing-globally-reached one like a telephone. Google duplex like systems addresses a broader scope of computer verbal communication and it feels like a step in the right direction.
If both sides were to use Duplex, it would already know to just send 1010 instead of verbal communications.
Also if it was unknown whether the opposing person was a bot, a bot could firstly send a common test according to some protocol to ask if the other one was bot by some kind of sound representing that. In which case both would start sending machine readable information to each-other.
If I remember correctly, CORBA was all about standardizing APIs. You'd have your distributed "CORBA Objects" whose methods anyone who knew the unique object ID could call. The idea was that there would be a "standard library" for each industry. So all the barbers would implement a standard library "Appointment Schedule" object, all the exchanges would implement a stdlib "Orderbook" object, etc.
CORBA would generate RPC stub objects for you in various OOP languages, and potentially automate discovery, so you could say, give me all an array of all the orderbooks of all the bitcoin exchanges, and ask each for the last price.
Lol imagine my phone talking with an automated customer service line. Two machines, talking to one another, not using any of the existing protocols. My phone would have a database of questions to ask, form it into an English sentence, run it through a text-to-speech, transmit this to the other phone. Their "phone" would run a speech-to-text, run NLP, match it with its own database, and do the whole thing again in the opposite direction.
This gives a whole new meaning to "all of UI/UX is basically prettifying database queries".
And as you said, format isn't enough. You need semantics.
If two applications know enough about the other side to know how to formulate their voice queries, they know at least enough to exchange those same queries as text, and skip the stupidly wasteful text->speech->text process.
(And if world wouldn't be so full of adversarial practices driving engineering stupidity, the developers would agree on an efficient binary format beforehand.)
A large part of introducing new tech is enabling the transition from existing tech. Maybe if we scrapped all old cars and allowed only autonomously vehicles we’d be be sleeping at 120 km/h next week. But some businesses still run on COBOL.
There are already apps/technologies that transmit information through audio at frequencies not audible to humans. It should be trivial to adapt this so that if two AI systems are interacting they can perform an "AI-handshake" in the audio at the start and then switch to a more efficient form of communication.
Correct. There are several levels at which this applies:
Phone hardware (microphones, speakers) are only calibrated to detect 'useful' frequencies for human speech.
The sampling rate used by audio codecs tend to cut off _before_ the human ear's limits e.g. at 8kHz or 16kHz. They aren't even trying to reproduce everything the ear can detect; just human speech to decent quality.
Codecs are optimized to make human speech inteligible. The person listening to you on the phone isn't receiving a complete waveform for the recorded frequency range. The signal has been compressed to reduce the bandwidth required, where the goal isn't e.g. lossless compression; it's decent quality speech after decompression.
It's completely possible to play tones alongside speech that we won't notice, but in the general case, not tones that the human ear can't detect.
The person who invents a way to change the little green phone icon with a little human icon for when they want to talk to a human will be a zillionaire.
But that is the far future. Realistically, I just don't see this as feasible any time soon.
This doesn't even need to happen on the voice connection. A register of "does this number map to a known system" would be enough. Then it's just up to common APIs.
The issue here is interfacing with ancient telephone systems.
If the two bots were to slip in some subliminal beeps and boops to recognize each other; then they could change their speech to very quick binary communication.
You can't just drop compatibility. We will have A.I. trained voice systems that mimic natural speech just enough to be understood by duplex while compressing the exchange to a minimum. Data transfer will be measured in microwords per second. Future versions of duplex will of course detect this kind of compressed speech and reply in kind, falling back to normal speech only if the immediate response is similar to north american confusion.
To save CPU/GPU/TPU there should be a high-frequency sound, as in people can’t hear, so the computers talking to each other and switch to a faster way to communicate.
If this is included you also have way to detect if you are talking to a bot/duplex.
Yes, but it could be very subtle and low bandwidth at first, and once both sides were convinced the other was a machine switch to a full speed screeching 56k modem [1].
Or just communicate "hey actually connect to this HTTP/XMPP/whatever address on the internet and we'll continue this from there"
1. Probably a bit slower, I've heard modern VoIP lines don't work well with traditional modems?
Sure, but they should consider the future. That number will only get smaller, especially if Duplex or other services say "we'll handle all your phone and online bookings for you for $SMALL_FEE, and still forward other inquiries to your phone as before".
So just put it in the hearable spectrum. Phones already make all kinds of sounds that no one under the age of 35 has any clue what they mean or why they are needed, and frankly they aren't.
Yes, and lots of sounds that the human ear can hear but are not used to decode speech. Also the audio is frequently recoded as calls pass from infrastructure to infrastructure.
However while this is useful to bootstrap a new technology rollout, 10 years on its just technical debt.
The amount of tech debt in the system behind credit cards is crazy, because originally charges where phoned in to the card issuer manually, and everything from then on - magstripe, chip & PIN, online only transactions, etc, has all been built on top, and the leaky abstractions show through in daily difficulties with the card system for end users, like lack of real-time balance (in some cases), lack of transaction metadata, etc.
On the other hand, the credit card system's backcompat does mean that you can still accept credit cards when the power's out. You just write down the number (or use an imprint machine) and let the customer go. And the semantics of credit mean that you can still make that charge even if an online transaction would have resulted in a decline—offline transactions are never declined, they just cause overdrafts.
I wonder if that resilience is worth the immense amount of infrastructure and engineering that is spent on maintenance of the technical debt. Does that maintenance drive up processing fees? I suspect it does, but not in any amount sufficient to explain the size of those fees.
True! Although I think that's more of a byproduct, rather than something designed into the system at the moment, and I suspect we could do better with designing it. For example, I doubt many shops have those imprint machines any more.
I also think the tech debt is holding us back a long way. For example, why can't I see itemised receipts in my card statement? Paper receipts are on their way out, email receipts aren't linked to anything or structured data, but being able to see that I've spent $120 on shipping with Amazon in the last 12 months, so a Prime subscription would make sense, would be a great sort of financial tool to have. That isn't possible in the card network at the moment.
They imprint "machines" are still issued - but all* of them are just tossed away into storage, and never ever used (Training users on those? Pointless).
Instead of a high frequency tone, just watermark the background noise or the speech pattern. You could watermark the background static, the voice samples, or even the speech patterns. All you really need is something like 30 bits of data to identify a call as a Duplex call with very high probability, and I’m certain you can find a way to imprint that many bits into the frequency spectrum of your background noise.
I like this. So basically the old school modem sound, but in frequency that can't be heard. It would only take a fraction of a second to send out the feeler, and would not be noticed if a live human picked up. Could even detect a human and send the call over to a live representative without anyone noticing.
It doesn't have to be out of frequency (since that's probably filtered anyway), could be just a really quick burst handshake identifier which could encode an IP address to communicate over instead of a crummy phone line.
Duplex: <beep beep> (I'm available to chat)
Other bot: <boop boop> (Oh hai! Wanna get intimate?)
Duplex: <blaaaaaaaart> (Come find me on duplex://64.233.160.0)
As an end user, picking up the phone to hear a beep is not pleasant. I'm likely as not to immediately hang up, as I've come to associate beeps at the start of calls with scammers.
What about if the caller makes no such sounds and the recipient makes the offer to handshake?
Anyways, this aspect is more amusing to just think about than anything else. That said, I really hope companies who produce these next-gen AI robo-callers actually have the courtesy of identifying themselves as such. I want to know if I am talking to a human or Duplex. Yes, I may hang up, but I feel uncomfortable being fooled into thinking I am talking to a human when I am not.
There's no reason why it can't be encoded as elevator music - you already hear it all the time, they might even throw in a looping "Thank you for calling. Your call is important to us" to keep you from freaking out.
Phone lines are optimized for frequencies humans can hear, though I'm guessing you could get enough bandwidth out of the edges to convince the other side you're a machine without bothering a human too much.
Yes, it would be nice if in parallel Google came up with open machine-friendly protocols for each of the use-cases Duplex supported, with a clear migration path away (e.x. businesses started publishing the endpoints and protocols they supported alongside their phone number so you could skip the call completely)
The problem is that computers will now be interacting with people and people will become unsure of whether they are taking to a computer or a person. It will create a bewildering world full of mistrust. I would argue that there should be a law proclaiming that computers must identify themselves.
They're not unaware of this concern. From the article:
> The Google Duplex technology is built to sound natural, to make the conversation experience comfortable. It’s important to us that users and businesses have a good experience with this service, and transparency is a key part of that. We want to be clear about the intent of the call so businesses understand the context. We’ll be experimenting with the right approach over the coming months.
So they indicate that they are aware of the problem - and instead of doing a straight-forward "hey, I'm a bot", their suggested strategies are "being clear about the intent of the call" and "experiment with the right approach over the coming months"?
To me that quote sounds more like a polite way of saying they definitely won't reveal to callers that they are talking with a bot than them taking the concern seriously.
Some of the conversation examples on the blog page where they invent a sort of story for the caller ("I'm calling for a client") would fit that theory.
People are prejudiced against talking to bots because of how bad they are currently. I despise calling services that have voice recognition, often times it's easier to have an options menu.
If they can get a way of saying "I'm a bot" without people hanging up the calls, I'm all for it -- otherwise, "I'm calling for a client" or similar is the best for everyone involved (assuming everything works).
Businesses also need to have a way to report problems to Google, like if they are getting spammed by Duplex or want to opt out.
> People are prejudiced against talking to bots because of how bad they are currently
I'm prejudiced against talking to bots because they're bots. They don't have empathy, whereas from voice interaction I expect a human I can relate to and desire to help and be courteous with. It's a fundamentally different type of interaction and I will be annoyed anytime that one is confused for the other.
Well, bad news: all the (presumed) humans I've talked to on any scripted call are worse than bots: they do have empathy, but the script forbids them to use it. An out-and-out bot is free from this prison, at the very least.
How about something like "I'm an automated agent calling for a client" - correct, not misleading, but using terminology which isn't likely to be immediately disconnected _right now_.
Of course, if they screw it up, they'll burn that terminology too.
If we accept your premise that bots become so close to humans that they are virtually indistinguishable, then at that point does it matter who the "person" on the other end of the line is? I'd argue it doesn't because the outcome would be the same.
I think there will always be a difference. When I'm talking to someone, there's an emotional connection and responsiveness there. I'm trying to help a human out -- I'm putting effort into being polite, into considering their point of view etc.
If I found out that they were a robot (this is probably unpreventable; even if the technology gets amazing, surely there will be edge-case breakdowns/bugs/etc.), my trust is broken. That would have an emotional consequence e.g. frustration.
There will always be amazing technology wielded by awful developers, and in this case the outcome is emotionally hazardous. The impact of that is not easy to quantify e.g. by any economic indicators, but it's there.
Also, it's likely that robots will not be as polite back, so we're degrading society's trust and empathy all around. For example, Google's AI call to a restaurant was rude, and not for reasons it seems to yet understand.
What happens then if I'm a human that masquerades as a computer? Seems like a neat way of explaining away a number of social faux pas or drunk-dialling. "Me? Noooo... that must have been my PhoneBot 2000. Supposedly the next firmware update should solve that kind of problem."
That said, I don't necessarily disagree; there is going to need to be lots of these kind of issues that need sorting out before we reach a Culture-level of AI interaction.
Phoning like this you already have to wonder if the person on the end is an idiot who will screw stuff up. I'm not sure it maybe being a google bot will make much difference.
I love how no matter how amazing something is, someone will eventually say it's trivial, even though it's taken the smartest people on earth decades to figure out how to do this.
>The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.
If you ever needed proof of the negativity in the tech community, this thread is it.
Google actually delivers a real world product incorporating the most advanced AI we've had the chance to experience, and half the HackerNews comments are "Wow, this is so dumb, can't wait for it to become technical debt in 10 years".
One time, a giant tech company staffed by geniuses released a super-useful tool that saw massive adoption. 10 years later that tool became technical debt.
Wait, it was way less than 10 years.
I'm talking about Google Realtime. Or reader? Or buzz?
No, wait, I'm talking about aggressive Twitter API deprecation/removal.
Wait, nevermind, I'm talking about Facebook.
You get the idea. What's revolutionary today sometimes becomes the substrate for future innovation. Sometimes it gets cast by the wayside, even in the face of significant "user" (developer) popularity.
That's not proof of negativity; just realism. Negativity would be "no new innovation will ever get traction". Optimism would be "all new technologies will change the world" (c.f. https://www.npmjs.com/browse/depended). This is neither.
Because idea of running machine2machine communication through Duplex, going binary -> text -> voice -> text -> binary, is just fucking dumb.
It's not proof of negativity in the tech community. If anything, it's a proof that tech community often can't pause and look if a particular idea makes engineering sense.
> The end game is clearly to use an api, not this.
My understanding is API integration is what wechat is in china -- every hair salon and equivalent-of-corner-pizza shop has some wechat integration, payment and all.
Voice bots like this will have the advantage of ubiquity. At least a couple years ago before every resteraunt had 5 tablets for all their seamless/grubhub/chowhound/whatever apps, pretty much the only reason the fax machine was still around was for restaurant ordering. Although there were clearly better ways of doing it (see how Dominos reinvented itself as a tech company), the sheer ubiquity of fax as the lowest common denominator kept the tech around.
In that light, it's kinda like the cell-phones-leapfrogging-landlines-in-developing-countries argument... part of the wechat story involves a massive population entering the consumer class at a time when everything was digital. Call me out if this is a gross over-generalization, but in a way, the wechat population never had to deal with the backwards-compatibility of people growing up ordering a pizza over the phone.
It'll be interesting to see how the API-centric approach (wechat) plays out versus the lowest-common-denominator ubiquity approach (voicebots). I'd stop short of calling API's the end game though.
Also, in the WeChat model, everyone is tied to WeChat and can't go around it.
This voice based model can be integrated into any existing system. It already has the network effect going for it and it's not tied to the fate of any one company
No, you just become tied to Google. You can't reimplement Duplex yourself without reimplementing both their API and their voice recognition verbatim, and the best way to do that is to simply use their product.
Not necessarily. Given the rate of progress in AI and the number of companies working on it, it's only a matter of time until Duplex-like tech is reimplemented by other large corps like Amazon and Microsoft, and eventually it'll could even be implemented by startups if there is a decent business case for it
My understanding is API integration is what wechat is in china -- every hair salon and equivalent-of-corner-pizza shop has some wechat integration, payment and all.
Meanwhile in NYC, good luck getting the bodega on the corner to even take your debit card.
Eh? The vast majority of bodegas in NYC take credit cards. Many have minimums, or charge a fee if you don't hit the minimum. But I've not been to a bodega in the last ~5 years that didn't accept credit cards in one form or another. > 5 years ago, sure, but now pretty much everyone's got them (even in the rougher neighborhoods).
If they don't accept cards, they almost always have an ATM.
My complaint in NYC is the uptick of "cashless" places, that don't accept legal US tender. I like using cash, I don't want it to go away.
Ya, I agree that an API is the right solution but the benefit of this is that both sides aren't forced to adopt it at the same time, it's more resilient to changes on either side...
Where did I say it’s trivial? What I’m saying is, people ar anthropomorphizing and extrapolating what this system is doing far above what it is actually doing, and then using it to justify fears that Skynet’s around the corner.
This system can’t pass the Turing test, it would be fooled probably by a simple question about itself or a subject outside the domain, like the kind of food you like.
You’ve got people in this thread hyperventilating about AI duping your voice and the. Becoming a doppelgänger and therefore we need laws immediately to stop this dystopia? Let’s calm your cortisol levels for a second and stop acting like Thanos just got the last gem.
I don't think any technically minded people on HN are extrapolating what this system is capable of doing, but are (rightly IMO) extrapolating what kind of systems will be announced in 2, 5, 10, etc. I think even HN is greatly underestimating what world class researchers paired with an army of world class engineering talent are capable of.
I think it passes a limited Turing test at the domain it’s trained in. I doubt any of the people on the other end of the call would even suspect it’s a computer. That’s an amazing achievement.
"Duplex seems trained against this corpus. The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex."
So, you would just have an automated booking system API which is better handled by not placing calls as its form of communication. Right?
This is an API that requires no computer on the user's end and is portable across different implementations from different companies.
It's not ideal. Actual standardized APIs are better. But, uh, have you ever worked with industry standard APIs? I have, and standardized is not how I would describe them.
I still think there's a need for standardized APIs in this situation. At some point, the context constraints mentioned in the blog post have to get translated into some action with parameters. I'm guessing that action will be API calls to other Google products behind the Duplex Google Assistant UX.
"Ok Google, can you reschedule my Dr. Appointment this Friday for next week? I have a conflict." -> calls the Dr and reschedules -> adapts result to rebooking action with partners (ie, an api call to your Google calendar) -> applies action and responds to you.
There is still quite a bit missing from this to be a useful AI product. It's getting really close though. I can't wait until this makes it into Google Assistant and it can call a restaurant to ask about gluten free options while I'm driving.
You're absolutely right! There's a huge need for standardized APIs for interacting with outside systems for this use-case. In your example, your doctor's office.
In practical terms, there may be some minor issues such as incompatible multiple implementations and adoption costs. But that's made much easier to handle by a very small number of expected consumer systems.
As for interactions with end-result partners, well. I've worked with standards designed to represent such highly general cases (xcbl and cxml). They're invariably rife with interoperability problems and other issues arising from overly broad standards. These tend to not get better over time as much as one might hope, as it's not easy to continuously update standards at a reasonable speed across N target types of partners. Keeping up with how usage evolves is never easy.
The best approaches to this that I've seen in use are those that focus on providing a vehicle for arbitrary data for delivery to the app - like HTTP or TCP. Getting more specific is the route to madness. Which, unfortunately, is probably precisely the bit you'd most like standards around.
You're completely right. There's a very real and very important need for standards here. There just might be some issues worth mentioning that might arise from the attempt to create and rely on them.
This is my initial reaction too, but APIs need work for coding, integration, testing, make sure data is sent in the right format, etc, while voice-robot-to-voice-robot will just work out of the box.
You're completely abstracting away all the coding, integration, testing, and "data formatting" (read: grammar) involved in Duplex, which seems to be much more complex than an REST API.
That end game seems very wasteful and Rube Goldbergian. Why use an 1800s technology as the transport layer when salons, yoga studios, and more already use things like MindBody, which already has an appointments API? I’d honestly be way more interested if this integrated with MindBody, OpenTable, DMV websites, car dealer appointment systems, medical office scheduling systems — all of which already have APIs or at least web pages. But then, saying that you wrote and will maintain some WWW mechanize stuff that posts forms is way less marketable to the general population who see this as magic.
Also, they’ll discontinue it after a year once it gets enough negative press about how it doesn’t work well and loses business for businesses.
Because there are lots of businesses that don't have a booking API and don't see the need for one, or can't afford one. This kind of technology allows interaction with them, because it's easier to interact on a common transport protocol than to expect everyone to change to your preferred one.
That being said I feel like it won't be long until this tech is used for scamming, phishing and pranks.
Because businesses can offer Duplex for customers to call and speak to after hours if they still want to keep a human phone assistant. It's modular, not Rube Goldbergian.
Likely it'll be a transition technology for businesses still using phone appointments as their primary API, and eventually it'll ease the path for more direct integration.
It's unfortunate that in 2018 we still have to resort to klugey work arounds like this with so many restaurants and hair salons instead of being able to make reservations online. There are services like OpenTable but even here in Silicon Valley only a small minority of restaurants use them. It seems like there's a huge opportunity if someone can crack the market.
That's the universal criticism of every technology.
"I think it's more unfortunate that so many people are just so opposed to looking up directions to wherever they're driving before they get in the car."
"I think it's more unfortunate that so many people are just so opposed to paying their bills every month."
"I think it's more unfortunate that so many people are just so opposed to carrying cash around and counting change."
"I think it's more unfortunate that so many people are just so opposed to coming over and talking in person."
"I think it's more unfortunate that so many people are just so opposed to washing their dishes by hand."
"I think it's more unfortunate that so many people are just so opposed to doing long division."
I don't mind talking to "someone". But I sure mind talking to customer service folks who have absolutely no interest in talking to me and make it as difficult as realistically possible.
Worse, they're probably gonna spend 3/4th of the time trying to sell me shit I don't want and make me fight against it.
Online, I can ignore any prompt and just click next next next finish, and the form won't be in a bad mood. I have no interest in talking to an annoyed clerk, and they obviously don't want to talk to me, so we can just avoid each other.
When I was signing up for Internet at my new apartment, there were 3 ways I could do so: by contacting my apartment's official representative, online, and through the regular phone system.
I used all three. First, I contacted the representative, who gave me a price. Then I looked online and found the actual price (considerably lower). When I tried to sign up online, I was told I'd need to provide an extra security deposit because I have my credit reports frozen.
So I called the generic phone system. The agent gave me another price (lower than my official representative, but still higher than the website). I pointed out the website price, and the agent switched me to that price. I asked if I'd need to provide a security deposit and they said no. They finished signing me up, and everything was fine.
The whole process was annoying, I would have loved to have someone else do it for me. This was the perfect time for a phone assistant to step in. But that would have been a really bad idea with Duplex.
The point is - an automated call system probably doesn't protect you from an abusive representative. If I had Google Duplex handle either of my calls, I'd be paying more for my Internet right now, because I guarantee Duplex isn't smart enough to determine if a representative is lying about an advertised price.
95% of the time this probably doesn't matter, because most people I talk to on the phone aren't abusive. But if someone does want to upsell you or bury you in service fees or waste your time, Google Duplex is probably making their job easier, not harder.
It’s the lack of indication that you’re talking to an automated assistant and that fact that it uses human affectations in its speech that creeps me out hardcore.
As crazy as it seems at first glance, a double Duplex system would be a really beneficial result.
What stops businesses from setting up Apis from scheduling services today? It's the lack of a universal API.
If a barber shop wants to make it possible for a 3rd party app to book appointments then they have to release some API. But that's not the end of it. The 3rd party app has to first discover their Api, someone has to understand it and write code to use it, and then deploy that code.
I'll repeat for emphasis: This is mainly a problem today because there is no universal Api that all services can use
With Duplex, verbals speech becomes a universal Api that every service can parse and communicate to each other with. Also, the discoverability is taken care of by using publicly cataloged phone numbers on services like Google Maps, Yelp, etc
The problem of universal API is entirely orthogonal to voice communications. Duplex is not a Turnig-complete system, it's just an API behind a voice recognition layer. All the important problems for universal APIs happen after that layer.
Ultimately, what you describe can work perfectly only when everyone is using Duplex, which is equivalent to everyone using Google-defined API. That's not universal, because you have one entity behind it.
The only way this brings us somewhat closer to universal API is that if you expect it to handle humans as well, it introduces some constraints to the space of possible APIs, which could make it easier for everyone to agree on a common format. Constraints of natural language processing without a human-level AI require your API to be very fuzzy and very lenient. There's nothing stopping one from implementing those same constraints over a text or binary protocol. Nothing except no reason for businesses to do it.
Everyone doesn't have to use Duplex, everyone would have to use a system that allows it to seem like a human is talking to the other person, provided you limit the context to a particular domain (like taking reservations).
This system could be developed by any company with sufficiently advanced ML chops
>The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.
Is this satire? If this is indeed the future, I wonder if there is an irresistible urge to make systems as inefficient as possible. Kind of "like gases expand to fill the container, applications become as inefficient as the power of the hardware allows".
Dude, this system has better conversation skills than me. I mean literally. Well, I'am autistic + esl (+ kinda too why). But still it's kinda incredible that a system is actually better.
"The funny thing about AI is that it’s a moving target. In the seventies, someone might ask “what are the goals of AI?” And you might say, “Oh, we want a computer who can beat a chess master, or who can understand actual language speech, or who can search a whole database very quickly.” We do all that now, like face recognition. All these things that we thought were AI, we can do them. But once you do them, you don’t think of them as AI. It has this connotation of some mysterious magical component to it, but when you actually solve one of these problems, you don’t solve it using magic, you solve it using clever mathematics. It’s no longer magical. It becomes science, and then you don’t think of it as AI anymore. It’s amazing how you can speak into your phone and ask for the nearest Thai restaurant, and it will find it. This would have been called AI, but we don’t think about it like that anymore. So I think, almost by definition, we will never have AI because we’ll never achieve the goals of AI or cease to be caught up with it."
First of all you can't brute force the chess. Algorithms that beat high level chess players are non-trivial, even with todays computers. In fact top engines today all use carefully designed heuristics hand-crafted by experts -- this notion that Big Blue, Deep fritz, etc were dumb "brute force search engines" is a misleading tale.
Second, in the 70s there was no computer power even for quite clever algorithms (that probably didn't exist yet) to beat top chess players. Chess was seen as a grand goal requiring utmost intelligence -- while it is obvious in hindsight, at the time the intuition was probably that extremely "intelligent" humans were required to play chess, and in fact the best chess players were among the most "intelligent" persons -- it was a clear exclusively intellectual task that few people were competent at. So many believed that chess would be one of the greatest challenges to AI (the clarity of the rules added convenience of research and implementation). Things like walking didn't seem intellectually demanding, so the common sense was that it is probably "easy". In fact today we know that navigating a bipedal robot in a simple environment through visual recognition is vastly more difficult computationally than playing chess well, it is only easier for us because we have highly specialized circuitry in our brain hat is well matched to those tasks. Our brain wetware is not very well matched to playing chess.
Also chatbots have been doing pretty well on Turing's original definition of a Turing test, ever since about 10 years ago. But now it is being argued that Turing didn't really see the "loopholes" they believe the bots are exploiting, and are coming up with more strict requirements for a Turing test.
That's totally in line with Tao's argument that every time we approach a major AI goal, suddenly it is not AI anymore, because there's nothing magical about it, just boring old technology. And human brains are magical, right?
Until every obscure niche capability of humans has been dominated in every possible way by AIs many won't want to concede that it really is AI. And even when it does become better than us in every possible way, I suspect a few will still find arbitrary reasons why it really isn't AI/AGI, e.g. because it is not organic, because the computer lacks a body, because it lacks a "soul", etc.
You might be right about chess, but I can't understand how you think chat bots are "doing pretty well". I've never seen a conversation with one that held up to even the most tolerant hand-holding for more than a few sentences.
I mean they're doing pretty well by Turing's original definition. I agree chatbots using traditional techniques (not sure about newer LSTM chatbots) are not too impressive, just illustrating that we've had to adjust the definition. That's by definition moving the target.
In fact I'm quite sure Turing would be quite impressed by good recent chatbots.
From the point of view of the 1940s, this would seem really close to a veritable "Thinking machine"! Although I'm sure he'd recognize a few things are still missing to fully replicating human behavior (or going beyond).
By understanding he means translating speech to text, I guess. We have speech-to-text systems that are better than the median human in the native language now. Quite amazing, given how central auditory language processing is in our cognition. And most people don't think it's "AI" (and certainly not anywhere near AGI). That's a good example of how AI is a moving target IMO.
Small business do have email. I use email to do this kind of thing all the time and it works extremely well. I feel like this is a case of silicon valley solving marginal problems while the world burns.
To the extend that the primary application for this is call support, I don't agree with your proposal. This is supposed to close the gap between a tech-savvy group that would be using duplex and tech-handicapped small businesses. It is much easier / effective for a restaurant for example to hook up with open table than deploy something like business-duplex.
Exactly. If a small business doesn't use open table or duplex, no problem - I can just use duplex to schedule the reservation for me. Open table requires buy in from the restaurant, duplex doesn't.
The Turing Test is irrelevant, the "dystopian" stuff is mostly irrelevant, the ethics are highly relevant. It is simply unethical for a computer to converse with a human while misleading them into believing they are talking to a human. There are a zillion reasons why this is so, if none of them seem obvious then I'd suggest investing some time to take an ethics course.
> Most people working in hair salons or restaurants are very busy with customers and don’t want to handle these calls
And those will have online booking systems already - I don't see how this technology is still relevant nowadays. Maybe it was back in the 90's when the internet (and online booking) was a new thing, but now? I can't see there's a big market for this application.
> Most people working in hair salons or restaurants are very busy with customers and don’t want to handle these calls, so I think the reverse of this duplex system, a more natural voice booking system for small businesses would help the immensely free up their workers to focus on customers.
I think this perspective is very short sighted. You will lose customers to automation, but businesses wont turn away customers because of automation.
Customers and prospects don't want to interact with machines, but businesses should be willing to give customers what they want.
The idea that a tool can be rolled out to millions of consumers, even with limited use cases and not have to get adoption from businesses to be useful is IMHO a much bigger opportunity and much better use case than rolling out a tool to businesses that make the interaction less personal.
Customers need to trust businesses, business only need to collect money from customers.
I think everyone who focuses on chatbots from the business use case perspective is missing the bigger opportunity.
A technology that can give a consumer access to ALL businesses, not just the ones who adopt a new technology offers much more utility than serving businesses or the shortsighted use cases like saving time and money for the business.
Thats partially my point. Everyone is thinking about the business case for cost saving around automation for businesses, but that lacks imagination. Automation tools for consumers to interact with humans at businesses is where the real opportunity lies.
Would you ever voluntarily use an IVR? I wouldn't. If I am going to interact with automation for a business, I want to do it with a different interface than voice... all the hype around NLP and chatbots was uninspired and focused on the wrong side of the interaction...
Building conversational interfaces for consumers to use to interact with businesses is a much better use case.
I feel like a lot of people who are pointing out the limitations show a failure in imagination of where this will go in the future. I don't care about the obscure technical limitations now, those are just engineering problems that will be solved in very short time.
I'm not afraid of the machines going all singularity or skynet or whatever, becoming sentient and taking over the world as some kind of robo-Hitler. That's moronic. But what does worry me is the normalization of having a machine do everything for you, plan your whole life, access every little detail of every bit of your personal data and lifestyle.
Of course we've already had that for a while with the way phones work. But this is another step towards getting public consensus for using it in new ways. Once people are used to this, we'll have more and more systems with conversational software that manages your life for you. Speaks on your behalf. Interfaces with the world for you because doing it yourself is far too stressful and inconvenient.
And of course it'll be a free, advertising-supported model so all that data will have to be shared with, among many things, shady political organizations to try to gain every little advantage possible to manipulate public opinion and steer themselves into enormous power.
Think of where the cell phone started off: just a phone in your pocket. It's so much more now. Remember that when thinking about these AI assistants and what they will develop into. I'm not afraid of the classical AI apocalypse. I'm afraid that these systems will do exactly what they're designed to do. That people are underestimating just how much power lies in these little inconveniences in life, once they're all added up and analyzed and tallied.
> The people losing their marbles over this being some kind of Turing Test passing distopian stuff are missing the point at how limited this domain is.
When you mash 0 with Comcast, it kind of scoffs at you and tells you that it needs some information before it can help you. If you keep hitting 0 it just hangs up. Pretty terrible experience, but they don’t have to care because they have monopolized so many neighborhoods.
> But that in itself is not even true across the industry, some(most) phone bookings are very complex
Citation needed for that "(most)". I work for a company with a call center and a large part of calls are simple ones that could be easily answered by just reading the FAQ page on our website.
> otherwise they would just use a web interface.
I think the problem is more about resources. My local hairdresser use his phone and a notebook to take bookings. It takes a bit of his time and could easily be replaced by a Web interface but he doesn’t have any resource for that (and some people still prefer using their phones).
> I think the problem is more about resources. My local hairdresser use his phone and a notebook to take bookings. It takes a bit of his time and could easily be replaced by a Web interface but he doesn’t have any resource for that
By "resources", do you mean money? Because if so I can't imagine the purchase and training of Duplex on the business side would come cheap either.
Absolutely correct. And so-called general AI may never happen. Regardless, this is shocking. It immediately needs to be factored into any speculation about what the world will look like in 20 years. Innumerable questions.
> The people losing their marbles over this being some kind of Turing Test passing distopian stuff are missing the point at how limited this domain is.
Right, and this kind of comments will continue for a while.
The question is - when the "this is trivial, move on" type of comments will start to fade out? Five years? Ten?
* In the example where Google asked about holiday hours -- they can now automate gathering information about businesses in bulk without having to rely on any APIs or user supplied info. Interesting thought experiment is Google validating their reviews/business listings by actually calling businesses and speaking to a real human.
* This is going to be fantastic for accessibility. Maybe I struggle to speak, and I want a reservation. I can have the machine do the irritating work, and focus on just having a nice meal or getting a service (like a haircut).
* Google can scale out your requests, one to N. For example, 'Make me a reservation at a 4 star restaurant next Friday.' Google can immediately initiate calls against 15 restaurants and let you pick from the successes, then automatically cancel the reservations for the places you did not choose.
> Google can scale out your requests, one to N. For example, 'Make me a reservation at a 4 star restaurant next Friday.' Google can immediately initiate calls against 15 restaurants and let you pick from the successes, then automatically cancel the reservations for the places you did not choose.
This sounds like a nightmare for businesses. This time commitment asymmetry will be the issue with these systems. Like email spam, it becomes much easier to waste others people's time when you automate time wasting. If people use it to flake a lot, I could see businesses just not responding to the assistant.
They pointed out during the presentation that the system could call once to a business and get the hours, then allow hundreds or thousands of users to see that without bothering the business again. Assuming it works, it could save a significant amount of time for some places.
>If people use it to flake a lot, I could see businesses just not responding to the assistant.
Then it sounds like incentives are aligned here. Google needs to not allow users to abuse this ability so that businesses will trust and not block them.
If they allow something like the parent commenter pointed out, they would sour relationships with businesses who would promptly seek out ways to block or decline calls from this system.
If they don't share the data then businesses will start being bombarded with bot calls (whether they know they are bots or not) once businesses start copying it, which should lead to regulation, however with how easily scams happen through the phone lines - I don't know how they'd be able to regulate and enforce this either to protect businesses' time.
>>If people use it to flake a lot, I could see businesses just not responding to the assistant.
>Then it sounds like incentives are aligned here. Google needs to not allow users to abuse this ability so that businesses will trust and not block them.
But then they can't take bookings through google assistant, which is going to lose non-trvial amounts of business.
Seems far more likely that they'd pay Google to automatically handle duplex calls for them.
The worse this technology turns out for businesses, the more pressure there is to pay google money.
>But then they can't take bookings through google assistant, which is going to lose non-trvial amounts of business.
If it's costing more money than it's bringing in, then it's no longer worth it. If it's bringing in more money than it costs, then it's a good thing for the business.
If a business needs to hire a dedicated phone person because they are getting so many appointments filled, they aren't going to be upset. But if they get so many flakers that won't show up, they are losing money as customers that will show up are getting pushed out, so they will block Duplex. There are also other ways to solve this problem, require a phone number and name and block or charge people for missed appointments if they try to reschedule. Require some kind of down-payment over the phone when making the appointment. There are tons of solutions to this problem.
At no point is "Pay google to handle the calls" an option. This is really only for places that don't have an online appointment system (that possibly integrates into google), so the solution to the duplex calls would be to invest in one. Since "pay google to handle duplex" would look a lot like an API to a scheduling system anyway, and an independent one (with integrations into Google's systems) would reach more customers.
Google may not be making money from direct payments from businesses, but getting access to yet another aspect of the life of their users (in this case, appointments made with what business and when), would be invaluable to them.
Just one more way that big business is taking over our lives. The world is becoming incredibly scary.
Knowing google. There will likely be a free tier and then higher tiers that handle multiple phone lines, complex call routing and eventually customer service case handling.
It has been a nightmare to deal with chatbots and voice-synthesized phone "customer support" gatekeepers. It's time to turn the tables. This doesn't have to only apply to restaurants and such, much better to let this loose on the various unresponsive megacorps. Who knows, they may find it less costly to actually deal with customers than pay for compute against a worthy adversary.
2018: customer support becomes telephone robot arms race. It’s hard to keep believing that capitalism is actually a productive worthwhile pursuit for humanity.
There's a customer on the end of the line with a serious intent to book, restaurants would be dumb not to hire a minimum wage worker to take those calls.
If you could read 15 spam E-Mails to have a >50% chance of a $100 dinner reservation you'd hire someone to read your spam.
Yeah, the phone lines getting that polluted is not a pleasant prospect.
This conversation reminds me of "Lenny", a simple bot someone created to talk pointlessly in circles with telemarketers for hours until they hang up in frustration.
Yes, I was thinking the same thing. This Duplex system would be like a super-Lenny. Eventually telemarketers will probably figure out, though, that if they reach someone who is calm and friendly and very receptive to their pitch, yet never quite gets to yes, it's probably an AI.
By the same token, telemarketing companies could employ Duplex to call people. I guess if a Duplex telemarketer reached a Duplex telemarketer-baiter, the conversation could stretch on for a really long time and might make for amusing fodder on Youtube.
In the olden times, I used to look up a business in the phone book and call them to see what their hours were. At some places, every call was picked up by a recorded message that stated what their hours were.
Today, it is much easier to just put in your preferred time and rating into Open Table and see what's available.
The easy solution (and Google should push for it) that the call receiver is Google Assistant as well, built-in to a landline phone kept at the business!
> they can now automate gathering information about businesses in bulk without having to rely on any APIs or user supplied info
Besides the business-side run-your-business-like-a-callcenter application, this is the "caller-side" application that's going to make money for Google. The others (virtual personal assistant) have been historically hard to get mass consumers to pay for.
It's straight out of the StreetView playbook... industrialize the scale of data collection, give the results out for free, then monetize the eyeballs (eg more accurate yelp).
Then again, Google ran that free 411 service for a few years that turns out was just a massive natural voice recording data corpus miner...
" Business model
Google had stated that the company originally implemented GOOG-411 to build a large phoneme database from users' voice queries. This phoneme database, in turn, allowed Google engineers to refine and improve the speech recognition engine that Google uses to index audio content for searching. "
I’m not sure what is a 4 star restaurant given that the Micheline guide gives maximum 3 stars, but pretty much all starred restaurants want a deposit when reserving.
If seriously there was someone dumb enough to do something like this, booking and cancelling continuously restaurants, then probably everyone would require a deposit when reserving and the deposit would be much bigger than today.
I've never heard of deposits for reservations for a restaurant. And I've at least once made a reservation for restaurant with a Micheline star.
I'm not sure if this is common practice at least in Europe. I'm sure it's unheard-of in Turkey (and I've been lucky enough to make reservations for some high end and/or very popular restaurants).
May I ask in which countries you've experienced this in? I'm genuinely curious.
I’ve had reservations at upscale places in LA ask for a credit card up front and threaten a fee if I were to no-show, and some places have minimums for premium tables (e.g. 71Above window tables). A deposit up front doesn’t seem too far out of the question.
I travel all over the Us (Texas, NYC, and Cali mostly) and I've literally never had to put down any kind of deposit or anything of the sort when making dinner reservations, and I frequently make reservations at plenty of nice, upscale places.
Many high-end restaurants in NY, CA, Berlin, Copenhagen, etc. require a deposit. See Tock (all prepaid, https://www.exploretock.com) or Resy (some prepaid, some cancellation fee, https://resy.com).
Funny enough they also use OpenTable, so no need for this Google stuff — just use the API or book through the site. You can also call them if for some reason that is preferable.
I live in Portland, OR and I have to pay up front for a number of places I go to. Usually not the full meal (I think one or two places do that) but there's a deposit essentially that is taken off your bill (or given back if you cancel 24-48 hours ahead of time) at a good chunk of the nice places around.
Google itself (Google Search) is extremely wary of being used by "bots" and uses string of captchas to screen them out. But search is just a machine.
That Google would create bots to talk to real people is horrifying. This is only possible if Google doesn't in fact think of working people answering the phone, as really human.
This is like doing war with drones instead of soldiers. This may sound over the top, but bear with me.
The implicit contract in war is that soldiers are legally authorized to kill because they are risking their own life. Killing people at a distance without risking anyone's life on the side of the shooters, breaks that "contract", is fundamentally unfair and fuels terrorism, because terrorism is the only possible answer.
Making a telephone call rests on the same convention: you are allowed to make someone spend time on the phone with you, because you're spending your own time.
But if one side is a robot that has no costs, then the relationship loses balance and becomes unsustainable (and this is the reason why Google bans bots on its own servers). This is one more step breaking society, again.
The only answer is to either stop accepting phone reservations, or put captchas on the other side.
> The implicit contract in war is that soldiers are legally authorized to kill because they are risking their own life. Killing people at a distance without risking anyone's life on the side of the shooters, breaks that "contract", is fundamentally unfair and fuels terrorism, because terrorism is the only possible answer.
I don't think there's any sort of human risk contract like that in war. Wars fought entirely between human armies still produce spite on opposing sides.
Going back to your point about calls, humans and machines call me to ask for polling information, telemarketing, etcetera. I'm not okay with them wasting my time, whether they're human or machine. However, I'll tolerate it below a threshold as part of the costs of having a communication channel. Beyond that threshold I would consider alternate measures like changing my phone number, getting rid of my phone, or paying for a screening machine or service.
Newspapers (and listening to heralds) are inherrently pull-based - I choose to engage with them. Robocalls and spam are push-based - you force yourself on me, wasting my time.
This kind and volume of automation pervades written digital communication now, and has already been creeping into voice communication (scam calls and robodials).
Google bans botting against its own services, but can realistically only ban the botting it can detect. If you can detect NN voice botting, you can ban it for your own communications as well.
And if the technology to detect or tooling to effectively filter that doesn’t exist? Sounds like a great business opportunity.
I don't think you'll get far with analogies to bots killing people. Maybe there is a narrative for automation leading to direct harm to humans, but Google Duplex is not currently close enough to that.
Instead, I would change your analogies to real bot/human problems today, such as phone bots scamming individuals for millions[1], or an older problem – email spam.
Basically any platform with a large imbalance in the effort (time/money/labor) spent by two sides (scammer/scammed in my example above) can be abused. But it can also be used for very good things. So we can't make blanket statements about these things.
Captchas and other bot filters are built to balance out the effort ratio so that abuse becomes more costly, and I'm sure if either Google or other companies abuse robocalls, people will have to respond with similar measures. But it's not a new problem, and if Google plays their cards right they may actually reduce the volume of calls that don't lead to business, while maintaining or increasing the volume of calls that do lead to business.
> Making a telephone call rests on the same convention: you are allowed to make someone spend time on the phone with you, because you're spending your own time.
I personally, absolutely do not think of it that way.
Whenever I get called by a call center, I always immediately say "not interested" and drop the call. It's rude, but I think these companies are not entitled to my time.
Because of that I think the problem is already there. This is certainly another step in the wrong direction, but thousands of workers in thousands of callcenters are basically already a human botnet.
> The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.
This part stuck out to me during the Google I/O demo, as an intentional deficiency is an interesting design decision.
> as an intentional deficiency is an interesting design decision.
well, in semantics/pragmatics these discourse particles are often not deficiencies at all. They are signals with practical semantic purpose. "hmms" and "uhs" can signal attentiveness, turn-taking (turn holding, turn yielding, etc), agreement - just to name a few.
For any machine system to be able to pass as human, it will have to be able to control these nuances or people will pick up on something being wrong, though they might not be able to articulate precisely what.
I really enjoyed the machine's "uhs" and "uhms" in the demo speech. However, I felt the "uh-huh"s sounded forced. It's funny how these subtleties are very important in human conversation.
It's not a new thing. A famous tax preparation software introduced a "compute" screen that took a few seconds to make people more comfortable with the results even if the computation itself is instantaneous.
It's really just an audio version of a loading bar or spinner - users get really uncomfortable if the UI becomes unresponsive for even a few hundred milliseconds, but they'll wait for several seconds if it looks like something is happening.
People have learned that the spinner is non-progress, though. The progress bar still has some life in it, except that those are often fake, not measuring progress.
OS-level cursor spinners like the mac pinwheel have lost credibility, because they don't reliably indicate whether the system is temporarily unresponsive or needs to be restarted. Modern multitasking OSes have a wide range of situations in which they can become mostly unresponsive without actually crashing.
Spinners on the application or UI element level are more credible, but generally worse than a progress bar. They're still very useful as a comfort indicator for short delays.
Progress bars have very low credibility on Windows, because users have learned that they're basically useless as an indicator of wait time. A progress bar might get stuck at 7%, then suddenly rush to 100%; conversely, it might get stuck at 95% but never finish. The bar offers no real indication of the actual level of progress; in most cases, this could be greatly improved with a bit of educated guesswork.
A completely fictitious progress bar can be extremely credible, because it's totally predictable - if you need to create a 10 second delay, then it's easy to make the bar progress linearly from 0% to 100% in that time. Users learn very quickly that your progress bar tells the truth about how long they'll be waiting, even though it's lying about the reason for the wait.
> Progress bars have very low credibility on Windows, because users have learned that they're basically useless as an indicator of wait time. A progress bar might get stuck at 7%, then suddenly rush to 100%; conversely, it might get stuck at 95% but never finish. The bar offers no real indication of the actual level of progress
I disagree with this; I find the progress bars more credible with erratic timing. (And ideally, a display of the task currently at hand, like "Copying tiny file. Copying tiny file. Copying giant file............")
A progress bar that smoothly fills from 0 to 100 looks like an animation that somebody thought it would make you happy to watch. A progress bar that lags at 7% and then rushes the rest of the way looks like the software has some internal metric for task completion, and is reporting according to that metric. This implies that when the number changes, progress has happened, which isn't the case for a progress bar that isn't affected by workload.
The software can't use "how much time has elapsed?" as a progress metric, because it doesn't know how much time things will take, and because the passage of time does not actually cause -- or reflect -- any progress. That progress bar would be a spinner, not a progress bar.
> Spinners on the application or UI element level are more credible, but generally worse than a progress bar. They're still very useful as a comfort indicator for short delays.
Strongly disagree. A spinner on the web UI element that lasts longer than ~1 second indicates for me that the site's JavaScript broke again, and it's time to reload or wait for the devs to notice and fix it.
I'm talking exactly about that spinner. It's a lie. You quickly learn it has no relation whatsoever to what's happening in the background. And indeed it doesn't, because it's an animated GIF, completely detached from any logic or networking code!
(Compare the CLI spinner/fan - that "/ - \ |" animation used to indicate progress. There you know that each tick of the spinner means work has been done, because it has to be animated from code, and it's much simpler to just update it from the code that does the work.)
I was talking about animation. Show/hide on request made/resolved gives only binary information about starting and finishing something. But the spinning animation itself does not represent any operations being executed. It may very well be that the request failed and a bug in JS made it not remove the spinner. You end up with a forever-looping animation of "work", even though no work is being done. This makes the spinner an untrustworthy element.
Still better than nothing? Sure, maybe sometimes exceptions aren't handled properly, but at least you know that it was trying to do something, rather than having users click a submit button 10x because there was no UI feedback whatsoever.
The most annoying part of progress bars is the fact that programs so often use multiple bars. What's the point of watching a bar slowly reach 100%, only for it to be replaced with another progress bar that starts from 0 again?
The "please wait while we verify your passcode" on our corporate phone conference system drives me nuts. In the time that it took to speak that sentence, the passcode could have been verified millions of times.
In true market economy fashion, the comfort noise is also a perfect advertising opportunity.
For instance, I frequently deal with ATM machines that display "please wait" screens between every operation. Those screens last usually between 1 and 3 seconds, and it's obviously because the operations take that long, and totally not because they also display a half-screen or full-screen ad...
I've heard the HP12c calculator also slows down its screen refresh on purpose because people couldn't believe the math was right when it first came out and it was blazing fast.
Yep, and the 10 second "deal" compilations for travel packages really happen in a fraction of a second. They just purposefully delay the results to make it seem like they are doing a lot of processing in finding all the possible deals and showing you the best ones.
It can be a more friendly way of rate-limiting expensive DB queries. An interstitial that says "too many queries, try again in 10 seconds" is far more annoying than a loading bar.
Yup. We have a similar thing at my company. Every time we try to test out of the loading animation, conversion and retention goes down. It’s an amazing thing to see.
Most of the flight search companies do the same ("Finding the best/cheapest flights for you"). It's almost instantaneous, but they introduce this artificial wait.
That seems unlikey. Flight search really does take a long time because they need to make API calls to external services for most customer requests and they need to refresh prices roughly hourly and so cannot rely on cached data. Also, even the best flight search websites are frustratingly slow. If that delay was created intentionally then they already lost me as a customer as a result.
I can't seem to find that post right now, but a person (on Quora/reddit I guess) who worked in the development team of a flight search company told this fact.
I don't know if I'd call it a "deficiency" - if we interpret "disfluency" in a literal sense as "not flowing" without negative connotation, then the interruptions (hmm, uh, okay) are actually communicating useful information to the other party. I might even say that omitting those interruptions (and replacing them with, say, dead silence) might be poor communication.
The "um" isn't a deficiency, but the slow response is. If the response is artificially delayed to give the appearance of slow thinking, and an "um" added to fill the artificially long silence, that's an artificial deficiency.
I interpreted it differently. It isn't to give the appearance of slow thinking. It is to wait for the other person to be ready to accept the answer.
When talking to real humans, I've encountered people who don't do this, and I find it makes communication difficult and frustrating.
I'm not 100% sure why I need this pause, but I know I need it. Maybe I'm considering whether my question made sense or needs corrections/additions, so that I can't focus on the answer yet. Or maybe it takes time to switch the brain from "speaking mode" to "listening mode".
At any rate, when people do this, I have to ask them to repeat the first few words they said because I didn't catch them. And the reason I didn't catch them wasn't mumbling or background noise or anything. Well-formed sounds made it to my ear just fine, but my brain wasn't ready to accept them for a fraction of a second.
It's not a deficiency if understanding is increased. If that fake pause increases the listener's understanding of the sentence (it might), then the 'slow response' is not a deficiency but an improvement.
Edit: should the robot talk at 2x normal speaking speed in order to more quickly convey the necessary information? Slowing the speech down artificially so a human could easily understand it sounds like a deficiency to me. (By your definition).
Reminds me of comfort noise in the telephone system.
Even though the system encodes silences noise free (so improve compression), it deliberately inserts noise because otherwise people think the line is dead.
Similar to how when designing a virtual face, it looks more natural if it has some slight asymmetries and "defects", and when designing a synthetic drumbeat, if it's "perfect" it sounds totally robotic.
Imperfection is natural and comfortable. Perfect corners and edges are artificial and weird to the distracting point.
People are more willing get on board with bots acting like people than the other way around.
The speech disfluencies used by Duplex in the salon and restaurant interactions are perfect examples of why natural speech sounds natural. It's the cadence as well as the timing.
In my city I can recognize that google has two different 'voices' or voice libraries. They sound slightly different. I'm curious how that works and why it's not all done with one.
My understanding is that the "low-fi", more robotic one uses an offline TTS engine for when there is no connectivity. When connectivity is good, it will switch to better, cloud-based one.
If you are building a system that mimics human speech you need to teach it to be imperfect and use common parlance. Otherwise you will fall into the uncanny valley. If you listen to the conversation again there are several points where they lose immersion. For example no one would say 12 pm, they would say 'noon' instead. Google has clearly done some impressive work here, and I'm now a bit more confident/scared that they will be able to successfully fool me in the next few years.
Your argument about calling it "noon" instead of "PM" is just illogical. I'd always use PM instead of noon -- whenever I'm trying to be specific about something (appointments and such). I understand the argument you're trying to make but that wasn't enough.
I think historically "noon" expresses less precision, although I suspect that's less true now that everyone always knows what time it is and has GPS in their pocket to help calculate arrival times.
20 years ago, had I said to someone "I'll be there at 12pm" it would have had a stronger implication of precision than "I'll be there at noon." I don't think it's true today.
12pm is also commonly interpreted as midnight so "12 noon" or "12 midnight" is generally preferred when scheduling meetings or deadlines in order to avoid confusion.
Understood. So the difference is you interpret "noon" as an ill-defined probability distribution centred roughly around midday, rather than a concrete point in time, whereas you interpret "12PM" as a concrete point in time. Fair enough.
This is going to be awesome/terrible for social engineering attacks. We'll be training millions of people to to trust phone calls from "Google". But are you talking to Google-calling-on-behalf-of-your-boss, or Google-calling-on-behalf-of-some-phisher? Or better yet, some custom system that pretends to be Google over the phone? Who knows!
Either that, or it will train people to stop trusting a phone call as some sort of authentication. The whole social engineering problem is that when somebody gets a phone call, they trust that the person at the other end is who they say they are. Which can be a false assumption. When the robots start calling, maybe people will finally stop making that assumption automatically.
So what can you trust? The person calling you might be a simulated voice, crafted to sound like someone you know. The person texting/DMing/emailing you might be an attacker pretended to be someone you know. You can put trust in things like PGP, but they're sparsely adopted and still leave a huge attack surface.
At least in the demo, the system didn't identify itself as a bot. It pretended (fairly convincingly) to just be a person, and seemed focused on interactions (making a restaurant reservation, etc.) that don't involve someone recognizing your voice. In other words, I don't think the attack surface is any different than situations where you could just place the call yourself, save perhaps for the possibility of scaling it.
If you sequentially attack a subpopulation (e.g., employees at a company, senior citizens) the eventually news about your attack will spread throughout that network (internal security alert, evening news+daily newspapers+AARP+...). If you attack the entire subpopulation in rapid succession over the course of days or even hours, educational countermeasures become much less effective.
So now I can script a bot to book restaurant reservations all over the city at busy times. Then nobody shows up for the reservations, the busy time has passed, and customers have moved on or gone home.
Restaurants make or break on one or two nights in a month. A calculated social engineering attack like this could bring down hundreds of restaurants in a city, which would cause millions of dollars in lost taxes, and you see where this is going.
I meant, you could build a bot that calls. We have the technology already, and the people on the other end probably won't notice. Plus the "do it over the Internet" thing where screen scraping and scripting is super easy.
But could you build a bot that calls and is convincing enough to trick the target into actually accepting the request as genuine and reserving the timeslot?
Yes, the time commitment of having one person pickup the phone and place 100+ phone calls (and the suspicion on the other end when you call back with a new name but the same voice).
You could write a screen scraper to book online through the various booking systems, but each booking system probably has its own restrictions on how many accounts you can have and how often they can book. You skip all of these protections when you phone your reservation in (arguably, the restaurant staff should be enforcing these protections when they pick up the phone, but restaurant staff are often overworked and apathetic).
I agree it's a problem. The probably means of mitigation is for restaurants to take your credit card number when you book. Many already do this. I expect it to expand if false bookings become a problem.
Yes, which is why over time it has gotten progressively more difficult. I find some of the challenges to be almost impossible. (Also why Google started using yout session with them to bypass the captcha automatically, it would be too frustrating of an experience otherwise)
Credit card companies have already solved this problem by publishing a well-known 1-800 number that their customers can call back to verify the intent of an unexpected caller.
Google Duplex can do the same thing - without even staffing a call center.
If you get a call from someone who claims to work at the credit card company you use, instead of trusting the caller (who could be a scammer) and potentially divulging private information, you should just hang up and call the number on the back of your card.
I think the biggest issue won't come from phishers, but from resellers and other occupations that are helped by automation. I could easily call every hot restaurant in San Francisco a several a day to try to snatch a table.
On the other hand, most of these places have online reservations which are already extremely gamed.
That's now how this works, that's not how any of this works. If you have ever been cold called by Google I would go change your passwords and the like.
When people talk about AI taking over, they really need to consider this route. Social engineering most of the world simultaneously. It doesn't have to build armies of robots when we'll tear everything down for it.
On Windows, there's a group policy to require Ctrl-Alt-Del before the UAC prompt asks for your password. It's impossible for an application to hijack that key combination.
This isn't that different than all the bad actors calling from the "IRS". Educating people about giving personal details out over the phone has always been a challenge and will continue to be one.
Or maybe we should fix the phone system. Something is terribly broken when my connection to YouTube is better protected and authenticated than when I'm talking to the IRS over the telephone.
You’re only ever talking to the IRS over the phone if you initiate the call. Ideally the phone system gets fixed but that’s likely to not happen any time soon. Now you need to deal with current reality. Nobody should trust that any incoming call, purportedly from a business or government, is from who they say it’s from either verbally or through caller ID.
Why stop there? Why not fix human communication? Or how about we fix human behavior so there are no bad actors? Education may be hard, but scrapping and rebuilding a century-old communication standard in use by billions is harder.
I used the IRS example because the IRS never calls you. This is only known through experience, i.e. education.
> However, there are special circumstances in which the IRS will call or come to a home or business, such as when a taxpayer has an overdue tax bill, to secure a delinquent tax return or a delinquent employment tax payment, or to tour a business as part of an audit or during criminal investigations.
So the IRS might call if you owe taxes.
Surely moving from unsecured to a secured phone system can't be that big of a deal. In the US at least, we have experience something similar when we went from analog to digital television.
Add a secured mode of telephone calls and then telephone owners choose if they want to receive calls or messages from unauthenticated callers.
I don't think the intention of this system is for your boss to use Google to call you and ask for passwords. It seems very limited to scheduling appointments.
There's a great Black Mirror episode in here somewhere. Imagine five years from now, this tech and Google's Smart Reply (https://blog.google/products/gmail/save-time-with-smart-repl...) have evolved to the point where they can basically write entire email responses or have whole conversations without your input. You could develop elaborate friendships where both parties are just having their AIs converse and aren't truly aware of each other. What would the social implications of that be?
Then a couple years later, the AIs learn how to do business strategy, real-world problem solving, programming, etc and start doing more of our jobs for us. A virus goes around that directs AIs to steal our identities and drain our bank accounts and become autonomous, digital versions of us. The humans have to stage an uprising and use a massive EMP to take back the earth, but destroying all electronics in the process and starting another dark age.
I know that's not how AI really works (it's highly specialized and limited), but I'd definitely watch that movie!
> Imagine five years from now, this tech and Google's Smart Reply have evolved to the point where they can basically write entire email responses or have whole conversations without your input.
Ahum. Did you watch the keynote?
"An extension of Gmail’s Smart Reply feature, Smart Compose will suggest complete sentences within the body of an email as you are writing." [0]
Jesus. We're really reaching an age in which it's too inconvenient to talk to other people. I gotta wonder if the huge spike in anxiety disorders is somehow connected in there somewhere.
We're not using technology to do fantastic things. We're largely using it to enable fantastic laziness and entertain the habitually bored. I wonder what trivial little convenience would actually spur us to draw the line and say "Enough. Get out of my life." I'm envisioning a device which would access all your most secret sexual urges, but it would allow you to fart while sitting down without lifting one ass cheek. I bet we'd leap at that.
It was Avogadro Corp. and I didn't find it bad. So to counter your opinion, I give my recommendation here as a light sci-fi with plenty of mocking of Google and Apple included, and AI that - unlike in most stories - makes sense.
This is one end-game of AI - changing the economics of scams.
Imagine how a 'long con' works today: a scammer befriends a person online through a video game or social platform, and develops a rapport with them over the span of days, weeks, months. After some trust has been gained, the scammer then requests money from the victim. Does this happen a lot today? I don't know, but certainly one reason it doesn't is the economics of the scam. Who wants to spend a significant period of time gaining trust just for the chance of a payout?
AI is going to flip this on its head. Rather than dedicating hundreds of hours of a scammers time, a scammer could instead use a system like Duplex to befriend hundreds or thousands of victims simultaneously. Let it run for a few months, developing a strong rapport with the target, until the AI finally requests some money from the victim.
Yes, duplex is for completing specific tasks, but how much of a difference is there between "Duplex, book a table for four at 8pm" and "Duplex, ask my victim about how their day was"?
> Yes, duplex is for completing specific tasks, but how much of a difference is there between "Duplex, book a table for four at 8pm" and "Duplex, ask my victim about how their day was"?
The former (booking a table) is much more "constrained", as in the conversation would most likely not go into much of a tangent, because there are only so many responses to a statement like "book a table for four at 8pm" (8pm is full, 8pm works, etc).
Whereas asking someone how their day was would give the "victim" a much bigger breadth of responses (and additional questions!) that would cause the AI to stumble and fail to give a satisfactory answer. That, and running this 1,000 times simultaneously so that no one person would be able to intervene to "help" the AI would just be a highly unscalable operation.
And you think this is the end of innovation, this is it, no improvements from here?
Duplex is interacting with a human and that person has no idea its a computer on the other end. Yes, Duplex is limited as it stands, but what is there to make you think something as described in grandparent post won't exist in ten years?
Well, I'm already receiving automated calls about "that car accident" and they connect me to a real person if I happen to respond with a predefined keyword.
I would say that there's a market for that. How long before rough agents implement good AI for their operations? How long before we implement defences against AI?
Duplex problem is not specific tasks but limited domain. If you are engaging with a person, you are going to err on multiple domains randomly. Especially if you are building a rapport over a long period of time.
If an AI can do that, then singularity is definitively reached.
This is amazing, but one thing I don't really understand is this: earlier in the presentation, they demoed some new Google Assistant voices. All of them sound like standard computer-synthesized assistants. On the other hand, the synthesized Duplex voices sound indistinguishable from human speech to me, even without the "disfluencies" they include.
If Google has gotten speech synthesis to this point, why isn't Assistant synthesizing speech of this quality?
I have a feeling it's because this is such a limited domain.
During the demo it all sounded very realistic, except for some parts like the times. It would flow naturally then all of a sudden pause awkwardly and then say a time like "12 pm" in a weird way.
I have a feeling they are getting it to sound so realistic because there's a fairly small amount of responses and questions it needs to work with, so they can either pre-record real humans, or heavily tune a ML voice to sound as natural as possible.
I'm reminded of a thing I read about generating "yellow letters" for real estate leads. The letters look like someone handwrote them on a steno pad. If you used a handwriting font they would look fake. But you can't handwrite each one, that'd take to long for thousands of letters. What they do is get a handwriting font made of their handwriting, and sure it looks fake. But they write most of the letter template out by hand and use the handwriting font for just the parts that change. I wonder if something similar is going on here, where they have taught the speech synthesis a bunch of phrases and they color in the bits with synthetic speech where it's needed. The inflection is also easier to specify in those cases.
Google home's speech recognition can't even reliably turn my lights on and off every time, why would I trust it to book a restaurant reservation? I'd expect to tell it to book a 6pm table for 2 and wind up with an 8pm table for 10.
I'm with you. We've seen a demo, not a real product that people are using. I wish people would temper their expectations a bit. If product development has taught me anything, it's that it's easy to put together a demo to show how well a product can work in the golden path, but once you stray from it, it becomes incredibly difficult to solve all edge cases. With something like conversation, there are lots of edge cases that need to be handled. Once you have a missed edge case, the illusion of reliability falls apart.
I'm so very disappointed by google home. It feels like they left this project behind. It's infuriating how dumb this machine is, and how many times I have to repeat "ok google, hey google" for my old google home device that was working way better at the beginning. Google assistant on android also seems to work much better.
How does that differ from a human having the same conversation?
If I call a restaurant and say "Can I have a reservation for 6 at 8:00" and they write down a reservation for 8 at 6:00 without repeating it back I won't know until I show up at 8:00 with my 5 friends.
True -- I was mainly snarking about people with poor communication practices, not the technology (which seems great -- a godsend for people like me who hate phone communication).
Exactly my thought! I appreciate Google's effort in many fields such as distributed computing but in the past few years, a lot of Google IO announcements feel like vapourware/demoware that they're going to scrap/rebuild in a few years.
I'm assuming, perhaps incorrectly, that Google would have one backend service or set of services for all speech to text translation, and that all related products would be powered by it. They've been iterating on speech to text interfaces for at least 15 years now.
The Pixelbuds were supposed to be something similar, but did not live up to the hype....and I've experienced the same things with my home automation with Google Home, and don't get me started on using it so control Spotify etc..
> I'd expect to tell it to book a 6pm table for 2 and wind up with an 8pm table for 10.
That's a win, consider: It's the right restaurant, on the right date and 10 is larger than 2 so there'll be room enough. Ok, you have to wait two hours but given that this tiny place is normally so busy you have to book in advance what you lost in time waiting you will more than make up by the fact that you'll be the only people there expecting service.
This is amazing! I'm very excited for a future where more and more tasks can be automated, enabling humans to get higher and higher living standards with the same/fewer input resources.
In short, technology is a wonderful thing that allows very low marginal costs. This is what we need to make the future a better place, given a consistent or growing population.
"Technology is miraculous because it allows us to do more with less." This is a perfect demonstration of that.
> I'm very excited for a future where more and more tasks can be automated, enabling humans to get higher and higher living standards.
That's been the dream for a long time in some circles. With the enormous productivity gains and ability to leverage external energy sources (fossil fuels, solar, etc.) we could have built a society of wealth and leisure for all.
Maybe we will still. The hope is that if there is enough of a productivity gain in a short enough time period (like the introduction of AGI powered robots) that this could still happen.
The technology promises "wealth and leisure for all". The capital owners promise "it'll trickle down". Technical utopianists need to start tempering their optimism with the realities of human nature and design systems accordingly.
I think our only realistic hope is actually various non-profit foundations and such.
When you think about it, I can download (for zero cost), a high-quality operating system and attendant applications which would have cost hundreds of dollars 20 years ago, and would have cost a fortune 40 years ago. Ditto for educational materials, entertainment, etc.
In that sense, we are quite wealthy in comparison to previous generations. If charitable organizations can leverage the automation of the future to help people, we might then see all humans across the planet lifted out of poverty.
But yeah, I don't expect corporations to do this. And it seems unlikely that most governments will either.
Interesting perspective. I share your pessimism about corporations and governments.
But sadly I think government is the only institution with the necessary leverage (tax base, mandate, etc) to accomplish this. Non-profits are also fairly dubious in their motives, subject to corruption, and generally highly inefficient. I'm not sure they're going to be our saviors either.
I don't consider the present highly problematic, though it is imperfect.
The problem is the relatively imminent (next 50-100 years) reality of nearly complete human obsolescence. That's when you'll see societal degradation at previously unthinkable levels. And no that's not Luddite fallacy. The next epoch of technological innovation is going to be unlike any that came before.
A lot of people champion UBI, while forgetting that something like 3B+ people currently live on less than $2.50 a day. That's really all the evidence you need to know that the future is going to be pretty grim. Do we think it's more likely that plutocratic systems will award sustainable UBI packages to the mass unemployed via wealth transfer (which is anathema in said systems) or that market forces will discover the absolute minimum survivable income level and create new strata in first-world societies that hover just above pure barbarism?
Average income and wealth has massively increased over the last century, so we are slowly but surely getting to that utopia. But it happens at a pace that is too slow to be experientially perceptible.
I would add however that there does appear to be some barrier limiting the ability of ordinary people to accumulate financial capital, and I attribute that to friction/fixed-costs imposed by regulations.
New services, like Robinhood, and technology, like cryptocurrency, could address this, and allow wider participation in capital markets.
What past productivity improvements and/or economic developments have ever pointed to your fantasy being realized? How will thousands of workers in the Philippines and India losing their call center jobs result in "higher and higher living standards?"
I'm sorry but given historical context and popular capitalist intent, this future you speak of is simply fantastical. The working class would sooner be made extinct before a Jetsons-esque future of leisure for all came to fruition.
I recommend you look at the statistical evidence. The average wage in the developing world has doubled, and the global poverty rate has halved, over the last 20 years. That's all thanks to automation.
The wage growth has been widely distributed, though I don't know the median wage statistics off hand. I presume they're approximately equavalent average wage statistics.
As for jobs, most of those that existed 200 years ago no longer exist or employ a tiny fraction of people they employed at that time, yet we don't have mass unemployment. Automation has never had a broad-based negative impact on the demand for labor. Its effect has been exactly the opposite.
This is really horrifying to me. My heart goes out to the working-class retail people who are going to have to spend their days chatting with the AI assistants of upper middle class people too busy to call themselves.
If the shop owner gets a duplex system to field the calls, then the two robots can subtly signal to the other they aren't actually human, and then start shrieking like a 9600 baud modem to finish the dialog.
As someone who worked in tech support early in their career, I do not.
The entire process of fielding calls is terrible. You never know what's waiting for you when you pick up the phone. Could be someone with a terrible attitude that wants to take it out on you. I had coworkers who got PTSD, and a ringing phone would trigger it.
I would rather talk to a rational robot on the phone than a possibly irate human who, frankly, only wants something transactional from you and treats you like a robot.
First because it will fail a lot, as robot won't understand specifics such as what items on the menu do you want, oh but it's missing, do you want this instead.
And then you will multiply commercial calls, and spam mails will arrive on the phone.
Then people will abuse it to harass, annoy, attack competition, etc. Spam a restaurant with robot phone calls for a month and it's done.
Plus google will analyse all this data, because it doesn't know enough about you.
It's a nigthmare.
But it will be excellent for anybody with social skills. What i learn living in africa is that we became handicaped because we can avoid talking to other humans so much. Going back to france, my social, sexual and work life improved a lot because i had basically zero competition. The next generations will really suck at the game.
As someone who worked very closely with tech support and sometimes fielded calls, I echo this. It's batshit insane to field support calls, L1 especially. Honestly, no human deserves the wrath of support calls. You get yelled at for no apparent reason and all you can do is be patient. Who knows what the psychological consequences are. I have seen customer support people lose their temper in otherwise normal situations.
I had a company for 15 years and I still hate it when the phone rings. Phone calls were always messing up whatever I had planned for the day and it made it harder to get shit done.
I do feel for them, but at the same time we have these unintentional designs and systems that cause a lot of suffering all over the place.
For example, I was kept on hold for 2 hours the other day to try and sort out being double charged on my bankcard. If this sort of service can handle hold, then i'm in.
Exactly. People think it will be used for better service. It won't. It will he used to save money. At the expense of the human. When was the last time you were satisfied with a web site help page ? They usually suck big time. They have zero way to help you on your specific situation.
I would feel infuriated to have to talk to a robot making me waste my time. At least during a waiting song I can put the speakers on and do something else.
Maybe it'll motivate more businesses to put more of their transactional things online where the bot can do it without bothering a real human, and real humans who don't want to use this can also avoid spending more time on the phone than they have to.
Honestly, the degree to which telephone calls are still necessary in day to day life is absurd. If this hastens their demise then I'm all for it.
Phone calls are necessary for anything that needs flexibility.
Coding flexibility in a website is incredibly hard.
Think about it. A restaurant doesn't have anything veggie on the menu, a last minute guest arrives and is veggie. You call to ask if there is something that can be done.
It's possible to code that, but it's a lot of work to get all the scenarios right.
I said transactional things, by which I mean things that don't require any special figuring for the vast majority of cases.
In the case of a restaurant, having their menu online would make it so you knew in advance of the restaurant has something you can eat. Lots of restaurants still don't do even this.
The store is still getting the customer, the retail rep is still getting the same call. What's so horrifying?
I see this as leveling the playing field -- upper middle class people can already afford real assistants a la Tim Ferris; this just lets everyone access it.
And the secret knowledge that a specific whistle from a cereal box - blown into the phone at just the right time - can land you a Saturday night reservation at French Laundry.
Oh dear, polyglot to audio steganographic phone calls, the audible message doesnt even has to be the same then the one encoded in microgaps and envelope shifts and all kind of modulations.
What makes you think this won't be used by people of all classes?
And did you hear the recordings? The AI is better at conversation than the humans it had to deal with. I would gladly use this to avoid talking to people who can't parse a plain sentence without rounds of repetition and clarification.
AI tools, including Google's Duplex, will benefit the bottom ~90% of the planet radically more than it will benefit the wealthy. That has been true of nearly all technology from the past few hundred years at least.
The wealthy will largely continue employing humans to do tasks for them, because they can. It will be just another form of luxury / status. An AI assistant will be considered beneath their class. The technology will be nearly universal and extremely inexpensive, two things rich people dislike as it pertains to signaling their status.
I don't think so. 48% of people have access to the internet, 41% in developing countries. All these people won't be able to benefit from this at all. That doesn't even count the ones that have dialup or comparable speeds (like on Cuba) where transferring anything beyond a couple hundred kilobyte is unbearable.
The wealthy will employ AI the moment it can do the same tasks as a human, unlike humans the AI is cheap, doesn't take time of work, always available, never tired or grumpy, etc. The maintenance work of having an AI worker vs a human worker is just way lower.
The moment a corporation like McDonalds can replace the entire staff with robots without having a dip in efficiency they will do that. 24/7 operation would be linearly more costly to 9-5 weekday operation.
Don't worry. They'll still get calls from plenty of people who aren't sure what they're looking for, don't have any manners, or are hard to hear on their speaker/car phones.
The problem is that I don't want more voice; I want less voice and more automation.
I don't want to have to interact with a human because the only time a company makes me interact with a human is when they think it benefits them (upsell, make leaving difficult, etc.).
What I want are good ways for me to handle this stuff via computer without human interaction.
> Duplex can only carry out natural conversations after being deeply trained in such domains.
The question becomes - How many people will really have the time and money to actually collect data to allow Duplex to take over. It might be true for large call centers but difficult at individual level.
Well, that will be until the shop owners install AI assistants to reply customers calls, then we’ll have robots talking to robopts using human language, which would make the whole idea nonsense...
If the robots can interact with people who can't use a computer, then user accessibility could improve over the current situation. Like most technology, this has potential to improve the human condition for many or few. It depends on how people use it.
The productivity gains from AI will make new jobs possible and will free up capital to be deployed in new economic creation. That is a cycle that has been essentially non-stop for the last 200-300 years (particularly aggressive and accelerating in that time frame). As populations decline and population growth stagnates, we won't have anywhere near enough people to fill the job openings in the future. The working class will benefit extraordinarily from the AI boom, it will raise their wages, it will produce an immense bounty of new jobs that will mostly go unfilled and it will drastically improve their standard of living.
Was thinking about that too. Even humans could start developing their personalized own shorthand lingo, think cattle auctions, ATC, or other narrow topic applications, that don't need formalites emphasize on less ambiguity, speed and confirmation. Could easily speed up Computer to human calls a very significant amount.
At least this might ease some of the pressure on the housing market in the Bay Area. By transferring minimum wage workers jobs from San Francisco restaurants and shops to datacenters in Northern Virginia and North East Oregon, we can free up valuable real estate for FANG workers struggling to find a room to rent within the budget of somebody earning only $250k per year...
This is the beginning of robots scraping the real world.
Most of the examples Google showed were crafted to make this look like a friendly agent acting on behalf of users.
But the more powerful use of this, as illustrated by the deemphasized "holiday hours" example, would be for Google to use it to get any information they wanted out of anybody the robots can call and conduct social engineering on.
Imagine coupling this with the knowledge available to someone able to read your gmail inbox.
"Hey you sent us an email two days ago about A and B..."
(trust is established)
"Can you clear up whether you were interested more in A, or more in B?"
I'm really impressed this year. I haven't thought that ML made the advances so much faster than it does already.
On one side it is not very impressive that it calls someone but on the other side its tremendous.
2018 we are able to synthesize voice so well and understand already such a small domain. 2018 a system calls a human.
We are going so fast already and we should use this google io as something as a social milestone otherwise we wake up tomorrow and totally misses when the future became now.
This is just one additional stone to a future where digital becomes a second reality. The advances in voice will not stop. How long will it take, that there is a speech model who can simulate everyone by listing to someone only for seconds or minutes?
With this voice etc. computer are able to teach humans. We will be able to scale teaching and a shit tone of other things.
Besides creeping dystopia of it, I find this aesthetically disgusting most of all.
If you listen to the sample call, the computer voice sounds incredibly _rude_, at least to my ears, especially at the end of the call. I would never speak to a service employee that way. I try to always say a clear and proper please and thank you, and would never ever want a robot to subject somebody to an"uhhhhh, thanks <hang up>" on my behalf.
It seems like in addition to the globalization of Californian social and moral standards, the world will now be subjected to Californian manners. What a pity.
Let's not get over dramatic here. To sound natural of course the computer system has to copy natural behavior, which includes manners. The experimental version of this system happens to include behavior modeling of phone conversations of people that live where the developers work and live, seems logical. It doesn't mean that the same model will be used in other cultural contexts. An ideal model would copy your behavior (and/or voice), so it would be like you calling.
Just have to let Google listen and analyze all your conversations :)
Mechanizing manners is just as creepy. Chik-fil-a employees aren't allowed to say "you're welcome", they have to respond to every expression of gratitude with the phrase, "My pleasure."
Exactly, and you're from California. You see this as normal. I'm from England. What Californians see as normal, or casual , or "chill", I think we see as being rude. I think this is extends in personal interactions, professional interactions and customer/company relations like this one.
I imagine that if I used this service to place an order to a restaurant, it would order with the Californian "cannIgettuhh", which I would be scalded for as a child. I find it very ugly and I hope that this isn't forced upon the rest of the world. But really, it's just one facet of the way technology is destroying a lot of interpersonal respect.
One can even imagine this "deficient design" being taken to it's logical conclusion with the inclusion of belching, grunts, and other inconsiderate bodily sounds.
The other poster says not to worry, surely the engineers will be more considerate about other cultures and customs when deploying this technology to the world, but I think that's incredibly naive considering this company's track record.
I guess I should be sorry for bastardizing your language?
But really, a conversation like the one you're looking for would be just as out-of-place in California as the one you saw in the article would be in England. It's important to match regional customs if you're trying to emulate and interact with humans.
I'm from the American southeast, and if anyone bastardized the English language it was us. When I listened to the recordings I cringed at how rude they sounded. If Google can build this thing, they can certainly adapt it to regional customs if they want to. Hopefully they do.
At last, putting the power of annoying robot phone trees in the hands of the consumer, to be directed at businesses.
For that reason alone, I'm excited. ;)
(Note: In fairness, the conversation demos are actually really slick and much better than a phone tree. I'll be interested to see how well it works in practice.)
How is it that the current implementation of Google Assistant can't even add stuff to my calendar in a natural way? I just tried the following: "OK Google, add an event to my work calendar for a meeting at Starbucks tomorrow morning at 11:30".
What was expected:
Title: "Meeting"
Calendar: "Work"
Location: "Starbucks"
Date: Tomorrow
Time: 11:30am
(Bonus points for associating an actual location, but possible ambiguity gives this a pass.)
What happened:
Title: "My work calendar for a meeting"
Calendar: Default
Location: None
Date: Tomorrow
Time: 11:30am
This is a great point. The current Google Now and Google Assistant voice controls don't really work half the time, let alone 95% of the time.
I really hope all this speech AI really comes good, I really do. But at the moment it still is flaky for me.
I really want to use Google Assistant to take instructions and reminders for me. I can't wait until it can smoothly send messages and emails and add calendar entries.
Never have we had a more clear example of why Google needs to be regulated, and laws against what they're doing needs to be passed.
The fact that they think it's okay to have bots pretending to be human making phone calls, shortly after demonstrating how quickly they can copy someone's voice (re: John Legend), it shows a blatant disregard for what they're creating.
I'm concerned that a lack of understanding and fear will limit the benefits we get from technology. I don't think that calling for regulation and making this illegal is the appropriate next step.
They explicitly say:
> We want to be clear about the intent of the call so businesses understand the context.
You note:
> laws against what they're doing needs to be passed.
What makes you say that robots making phone calls should be illegal? This seems an odd position to have. Do you also believe that it should be illegal for robots to clean floors? Or for your machine learning spam filter to process and filter your emails?
If I sound a little frustrated, I am. I'm scared that the future where technology allows us to do more with less will be blunted due to calls for making robot tasks illegal. I'd love if you'd help me understand what makes you take your position - I'm sure it's logical, and it's just a matter of me being able to understand your perspective. Can you help me understand?
Yeah and I had your view before the last 2 years and the 2016 election. All we've seen is this kind of technology will be abused to the gills to create fake propaganda, robocalls, and phishing scams.
I too want to absolutely, without a doubt, regulate or slow this stuff down until we can better understand how it can be abused. And WHEN it is abused, we have ways to address it and aren't constantly playing catchup to how people are badly using these things.
I am far more afraid of technology we can't control that works for an ad company, not for us, than moving technology forward a little bit slower.
We need to be able to trust our communication, and we're moving away from that at light speed. Where a company whose entire business model is to subtly influence you in the directions profitable to them is the middleman in all of our communications.
I am terrified of this. I am not sure I've ever seen a presentation that terrified me more that wasn't out of a science fiction movie.
> I am far more afraid of technology we can't control that works for an ad company, not for us, than moving technology forward a little bit slower.
I'm not sure what you're arguing for - that you wish companies weren't driving technological advancement? That Google specifically shouldn't?
Also, it's very, very conceivable that there are people for whom this technology DOES work for. The mute, for example. I'm not sure it's appropriate to speak for all of "us".
At the very least, there should be an requirement to identify when challenged.
"Are you a bot?" "Yes, I am Google Duplex v1.2. This call is being recorded; you can see our privacy policy and terms of service at http://google.com/duplex."
There's no way I should be bound by terms of service that are secret unless I know the "code-phrase" with which to ask for it. Google's whole point was to deceive the user into thinking it isn't a bot -- so you wouldn't even know to ask it.
People will be forced to "discover" that they've been lied to, when the bot is caught in a loop and they're going through emotional distress thinking the "person" on the other end is having some sort of mental breakdown.
It's illegal in many states to record a conversation without obtaining consent which is something this product does as a matter of course. This product shouldn't be legal to operate in those states, so you don't have to look very far before finding critical flaws in this approach.
That said, "identify on challenge" is something that Google might be more likely to adopt rather than "identify on call pickup". The latter being something they might spend more lobbyist dollars to oppose.
But what you've said is certainly (or should be) a real concern.
I agree, but the social pressure not to use it would be immense. Imagine the awkward moments where we start confronting people who call us and asking them if they're real.
It's already happening. I'm getting robo calls from some police donation fund that will slow down and restart speaking if you interrupt them (took me a few moments to realize it was a recording).
The irony of someone named ‘...trekkie’, a fan of a future where a military alliance of uber powerful weaponry capable of destroying populations, while continually proclaiming they’re on a mission of peace, featuring computers pretending to be people and doing work for them, androids, and synthetic holographic ar/vr people criticizing primitive AI tech is a bit rich.
We’ve arrived at full Luddism now where life saving and time saving technology in fields like health, transportation, and customer service will be inhibited by hyperbolic fearmongering.
Are you going to pass laws against realistic sounding synthetic voices? Against computers that understand queries “too well”? Against self driving vehicles that drive better and safer than people? All on theoretical harm that actually hasn’t taken place because you’ve watched too many dystopian Netflix sci-fi episodes?
I don’t know what’s more dangerous, the real Skynet, or people who might harm millions by voting for political policies that inhibit real improvements that could be made to help them.
At least, if you want to talk AI being used for harmful things we could discuss feed optimization that parasitizes people’s attention spans and keeps them glued or wasting money on pulling more slot machine levers. Technology that wastes people’s time and money as opposed to things that make people more efficient.
Does your secretary or father pretend to be you when making the appointment? Google Duplex is pretending to be a human without the knowledge or permission of the person on the other end of the call.
One of the samples claimed to be an agent acting on the behalf of a client. Google says they're still deciding on the best format for these calls--here's hoping they take the criticism of people like you into account, and go with something like that example.
Of course, I don't really see why there needs to be an attempt to pass as human at all. It would be fine with me if they just reinvented the phone tree with a better ux. Especially since it's billed as a business-facing call.
Would you say the same thing if this tech was created by a small company?
Voice copying is nothing new, many methods exists so I don't see how Google doing this is somehow bad. It's uncanny to say the least, but laws against tech progress? Come on :)
Did you think that advancing AI will be not creepy especially when it's good?
Come on now, this is cool! the correct initial reaction is to be impressed.
What is concerning is the surprising prevalence of technophobic notions on tech focused boards such as this.
There is something perverse about wanting to punish a company for creating something cool, and it’s certainly not the way our society or politics should function.
This is awesome, awesome tech. I just showed it to a bunch of people in my office.
But at the same time, the last two years have shown us that a lot of technology is being very effectively abused to misinform at a geopolitical level, and we as a society need to better understand or regulate this stuff for sure. Lives are at stake.
That said, I'd never say that Google shouldn't work on this. It's amazing. We just need to better understand not just how it can be used, but also how it can be abused.
How would you handle deepfakes? This is much more disturbing than what is in the post, and it isn't a corporation who built it. I don't think this has anything to do with regulating google.
The reality changes. Fast. You won't be able to trust photos, audio recordings, or videos. You will be able to trust digital signature. To have any opinion whether information about somebody else is real you will need a working web of trust. It's gonna be a great thing and I can't wait for it. It's going to change news, politics and in general the world we live in. It's going to be bigger than the Internet but way more polarizing and controversial. IMHO
One issue though, unless we hurry up and make this WoT decentralized and with open protocols, well, we will get it from FB and the likes.
To be perfectly honest, I've always personally been surprised at what business people are willing to transact over the low-bandwidth media of the telephone. I've heard plenty of stories of people pretending to be celebrities and getting away with stuff on the phone; I don't need a sophisticated computer to do that for me.
If a real problem of deep-fakes in phone conversations emerges, people will just raise the bar of authentication in phone communications. There's nothing magical about telephone that makes it exempt from the general need to protect against social engineering.
In the short/immediate term, it should probably be illegal to conduct a robocall without explicit opt-in of the destination caller. (Currently, if you get a non-telemarketed robocall, it is solicited, such as you scheduled an appointment and they are reminding you of it.)
Furthermore, if a robocall is permitted to be made to an unsolicited destination, a robocaller must clearly identify itself, presumably starting a declaration that it is a bot/automated agent from a given company on behalf of another company or individual.
And if the call is recorded in any way, shape, or form, the terms of service would need to be presented to the callee, giving them the chance to accept or deny said terms. (Note that if the example calls at I/O were not staged, and real calls, I suspect the recording of them would be illegal in many jurisdictions, including California.)
> In the short/immediate term, it should probably be illegal to conduct a robocall without explicit opt-in of the destination caller
Existing robocall law (including the National Do Not Call registry in the US) is focused around telemarketing. This isn't a telemarketing system; it's an automated assistant. I don't see the utility of applying the existing law to the new use case, as the goals are different (people at home don't want to be interrupted to be advertised at; businesses do want to negotiate business transactions).
> Furthermore, if a robocall is permitted to be made to an unsolicited destination, a robocaller must clearly identify itself, presumably starting a declaration that it is a bot/automated agent from a given company on behalf of another company or individual.
Why? If I have an assistant who makes an unsolicited call to a destination, they don't need to formally state they're acting on my behalf. What about automating the assistant's job makes it special?
I agree with you on the third point (I'm assuming Google has its bases covered there, because unilaterally recording a conversation is old and settled law).
As I said, it should be a law, I didn't expect the existing robocall laws to apply here. Note that in no way has Google explicitly limited itself to conducting business transactions merely because that is what it demonstrated here.
The issue is that the intermediary is a Google corporate entity, theoretically acting on the user's behalf, but at the end of the day, acting on Google's behalf. Consider that Google's bots may do things in Google's best interests, not the best interests of the party on either side of the transaction.
My assistant is another human being, theoretically acting on my behalf, but at the end of the day, acting on their own behalf. My recourse if our needs don't align is to fire them.
As we already have assistants who transmit our desires by proxy, I don't see much difference between a human and an automated script in that context---certainly not enough difference to justify the need for special-purpose law to clarify the nature of my assistant (and definitely not enough difference to justify shutting down the technology with only vague risk and no instances of social problems introduced by the tech).
> if the example calls at I/O were not staged, and real calls, I suspect the recording of them would be illegal in many jurisdictions, including California.
Interesting implications there for how Google will measure / improve the effectiveness of Duplex. Where is the line drawn between recording a call and recording data derived from a call? e.g. clearly recording the call unilaterally is illegal. Presumably just storing a hash of the audio data would be legal, but also useless. Is there some middle ground that is legal, but also useful?
The technology is really awesome but I'm not convinced it will fly:
* How much time do people really save, the call in the example is less than a minute. Maybe if you need to call like 10 places it becomes more helpful, but as companies get a bigger online presence (and they are) this technology becomes less useful. You can already make reservations/book appointments, find contracts pretty easily online.
* You can't be 100% sure that the AI won't make a mistake or sound like a total jerk on your behalf. Ok sure, the tech will improve but it will be a long time before humans will fully trust AI to represent you.
* If people find out you're calling them via some automated bot they're going to think you're a tool. Everyone remember Google Glass?
* What the hell is wrong with human interaction anyway?
yeah, my first thought was how useless this is, since it's going to take a lot more time to program the robocall with the information and read the text digest of the order than it is just to call or text yourself.
And this thing will have to text digest it, it can't simply record the call without identifying it is doing so, and that's destroying the illusion they are crafting.
The tech is extremely impressive, but this use case makes little sense except as a way to humanize it and get people to like it. It has to be more for businesses or some other solution.
I have a deaf friend who would love this feature. Sure, a lot of stuff like this can be handled online now, but not 100%. Sometimes you really need to make a phone call, and for some people that's difficult.
Assuming they're in the US, your deaf friend can use a free relay service. There are a bunch of variations, some that allow signing to an operator, others typing. All are free, and as accessible as this google bot.
(I assume your friend knows about these services, this is an FYI others.)
I found the first one (female AI, hairdresser) amazingly compelling, but the second (male AI, restaurant) sounded like The Good Doctor's autistic lead character, and to me, the callee sounded bemused at him.
Part of this is definitely that the girl voice sounds cute, and this partially disables my cognition.
But objectively, her approach is more tentative and polite, whereas the guy voice is more direct and assertive.
They might not want the guy voice to take on those feminine qualities, but it would make the interaction work better - so female AI's dominate.
The effect on the listener may also help - however, I'm not at all sure that other people (especially women) react as I do to the cute voice. They might even find the the Good Doctor male approach better - though I can't imagine that.
The trick of inserting "ums" is very helpful, but because they use the same sound-bite in the same way, it sounds mechanical after you've heard several examples. In the examples towards the end of the page, the odd latencies and (surprising) changes in volume were additionally offpitting.
After a few calls, recipients will recognize the patterns (esp if they use the same voice - can they varying voices convincingly?), and it might be better to have an honest reverse-menu system.
All that said, the first girl voice was great, and there will be progress.
I already had a comment above about this but another thing I thought of that I don't see anyone discussing:
What is the purpose of trying to fool the business owner into thinking it's a real person? It seems unethical, dishonest and disrespectful to the receiver having them believe they are talking to a real person. In the case of an AI failure at least the receiver will understand what's going on instead of becoming really confused. Sometimes I feel people in SV are oblivious to how their software can affect real human beings.
I would guess the purpose of fooling people is that you get a better result. If the call stated with "Hi, I'm a google assistant calling to request a reservation", you might get more hang ups. Of course, the reality could be the opposite as a business owner might make sure to speak more simply and clearly once they know they are speaking to a bot.
I don't care the awesome technical achievement. The fact that I believe that I am talking to a real person but I am not is the worst.
Can I call the restaurant and say:"well actually my wife doesn't like being cold so I'm not sure the terrace is going to work tonight" and have the computer answer something completely random is just so bad.
The problem is not that the computer doesn't sound natural. The problem is that it cannot deal with out of script requests. In fact the more natural it sounds the more dumb it makes the system appear!
It sounds like you simply want Duplex to be better than it already is, not that you object to the technology in principle. Let's say it can handle requests like position in a restaurant or food specials or <insert complex request here>, would you approve of it?
Funny how most people here are missing the point of this tech and talking about how this could be achieved by using an API or talking binary over the phone with another bot.
The idea here is to leverage an existent merchant base by adapting to how they work today. Suddenly they just integrated to millions of restaurants by adapting to them and not the other way around.
I'm sure Google Assistant will first check if there is a way to use tech like OpenTable to make the reservation and fall back to a phone call if there is no better alternative.
In the case where the person on the other end isn't a native english speaker, (the calling a restaurant clip) why doesn't Google figure out what language they speak and speak it to them?
I would imagine that a lot of the work that goes into making the speaking so natural would be very specific to English. As in, they would need to repeat 50% of the work for each language.
Do Scots count as native English speakers? What about the Irish? English has a very large number of accents and dialects that all count as 'English', but typically speech recognition software only works one variety.
I think the person you are responding to might have been talking about those who speak the language Scots, not just Scottish people.
Many consider Scots to be an actual language separate from English. There's a good amount of debate about this among linguists, I think.
For anyone curious, I recommend reading these pages from the recent Scots translation of the first Harry Potter book to get a feel for how it differs from English.
Currently I'm working on a similar project. Building a GAN (generative adversarial network) for voice is a lot of work and testing, you need a big dataset for labeled voices. Wavenet currently support English[0]. Also you need a neural NLP model for this language.
In short, I don't have a lot of faith in this being plausible yet. The major reason is the phone lines. Phone call quality is not solid enough to ensure a accuracy rate high enough to roll this out as a production service. There are others, like legal considerations, but if I had to pick one that would be it.
There's a reason why this rose to number one HN, people would clamor for it. To think that Apple and Google haven't been thinking the same thing is short sited.
It was the most impressive thing in the keynote for sure, but given the current state of AI we should have no reason to believe that this would work well enough to be employed at scale. It also feels weird that it is trying to trick the callee that it is a human. I believe even if they used a mechanical voice, the system could work, because businesses would learn to recognize the "google call" and respond to it the way they respond to any voice-enabled call center. It seems to me it is more of an (impressive) PR thing than something that will get actual use.
I agree. There's a whole world where I imagine businesses with stickers like the Visa accepted here stickers, except something like "Siri accepted here" or "Siri works here", whatever the wording. That way you would know that you could use Siri to interact with that business on your behalf. That takes away part of the resistance to phone bots by people that answer phones, them expecting those calls.
All this is great, and I am sure Google will use the tech responsibly, but I feel we are rapidly approaching a point where humans can no longer use their senses to discern critical truths about our own reality.
I honestly don't know how that's going to pan out. Just imagining it already makes me feel the kind of paranoia creep you get when you are too high.
That's true. People do not take psychological effects of technology seriously. The reason being most of us are already addicts.
Who knows if this comment is by a bot or a real person.
Interesting, many people here seem to be fascinated by this and I am sitting here thinking: Oh my good, I hope I will never have to work on a phone receiving such calls.
The technology is kinda cool, but when I think about the poor sound quality of some calls and my experience with voice assistants, I wonder in how many cases this will end in just garbage appointments or very poor experiences for the human on the one side of the phone.
Besides that, I like how Google is pushing to change the current way of making appointments. Maybe this will drive more small/medium businesses to use online services for appointments.
Talking with humans is often a poor experience too. I can imagine employees of small businesses preferring to talk to Google rather than customers directly, as unlike real customers Google should be predictable, and never frustrated or angry.
That's almost worse. Can you imagine having an endless conversation with a human sounding bot? Say you work at a super popular restaurant that has most tables booked out for a month or so, and doesn't have anything ready beyond that, and the bot just walks through a semi-endless list of possible options.
I wonder what it sounds like when it runs out of choices, or asked it to get a dinner reservation @ <insert popular place> any evening at any time for the next two months.
The weirdest thing about these trained neural nets too is the small tweaks that break them in very interesting ways. The future is truly a surreal place.
The demos look as exciting as any chatbot demos we saw before, and I think it will fail in practice just like how chatbots fail. With the exception of very few verticals in controlled settings, most real world tasks are way more dynamic and fluid than what we can comfortable code in some state machine. The article pointed it out, too
> One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations.
The question is after you limit it to "closed domains" narrow enough, where it can still be practically useful. It might help with certain functions in enterprise settings. It will definitely work for spammers because they can work with even 1% success rate.
I think it's unethical for a robocaller to incorporate things like "ums" and "ahs" intentionally to deceive people into thinking they're talking to a human. At least, it's disrespectful.
I agree with that. I think it's time for us, the tech industry, to start thinking about the ethical implications of the things we build before we release them.
To make this salient for people: imagine this technology being deployed for political robocalls. An attractive voice masquerading as a person persuading people to vote for someone.
To make this salient for people: imagine this technology being deployed for political robocalls. An attractive voice masquerading as a person persuading people to vote for someone.
They already hire telemarketing centers to do political calls; is this really so different?
I don't see any disrespect in lying about who you are in anonymous interactions. I'm under no obligation to be truthful. It doesn't matter to the restaurant if I book my reservation under a fake name. It doesn't matter if my assistant books it for me and they act like they're me.
I agree about everything you wrote about anonymity. What bothers me is tricking someone into thinking they're having a human interaction with another person -- that the things they say will be heard and matter to somebody.
I disagree; ums and ahs are just part of the protocol of voice communication, hence robots should use them. I think it should just introduce itself as a robot.
I think we are in agreement -- I didn't say I thought ums and ahs are unethical on their own, just as part of deceiving someone into thinking they are talking to a human.
They have listed the problems with such systems in the first paragraph. They claim to overcome these by restricting to very specific domains. But specific domains are usually still wide. A table for two, but in the garden or inside? Inside on which floor? Sometimes it doesn't matter, sometimes it does, and these things are specific and different for every business.
Really skeptical about this. And if this does become a thing, it will dumb down the interaction.
The bot can gather this restaurant specific information over several conversations with the restaurant. This wasn't possible before. This domain isn't too wide.
You are probably right that in principle one could eventually come up with a full catalog of features of a reservation. There would be about, say, 100 of those.
I seriously doubt that they will proceed to define and collect them, since those are probably 10% or less of all reservations, but lets say they would.
Then still, the conversation you make to make the reservation is a process in which you make the decision.
Say, there is a place inside at 20:00 or a place in the garden at 20:30. Are you going to let Google choose between the two options for you?
Do you imagine there would be an api in which you specify to the assistant, before it makes a call, your preferences in that much granularity?
I feel like this is going to make the world even flakier than it already is. If I can waste people's time without even the few minutes it normally takes to make a phone call, what's to stop a restauranteur from effectively denial-of-service attacking competing restaurants, or what's to stop me from booking 100 dinners for Friday evening because I'm not exactly sure what I'm going to want to eat (or who I'm going to take out for that matter -- smartphones make it easy to punt on that decision too), so I retain optionality.
What stops a restauranteur from abusing Google's Gmail service? Presumably Google won't allow you to do spam with Duplex either.
If you are going to be more enterprising then you can already do this. Just put some HITs on Mechanical Turk and let them place calls. Should cost you a few bucks to flood hundreds of calls.
I don't need reservations that badly. But the difference between having to specify Mechanical Turk work and just talking to a Google appliance is pretty huge... We're talking hours vs. minutes of work.
But I am saying that it is unlikely Google will let you spam reservations. This idea that some kid will now be able to make 1000 reservations via Google is not probable.
I think it's highly probable that either the technology will be rebuilt in a way that makes abuse possible (Google being relatively open with their technology has this side effect), or that Google won't put in enough safeguards to force people to use it responsibly, but we will see. Maybe you need too many AI experts and too much data to build this technology for yourself, and maybe that only exists at Google.
The voice sounds nice, but successful runs aren't that interesting. If it gets a question it doesn't understand, what does it do, and how does it report it back to the user?
It seems like the user is likely to get a certain number of confused recording sent back to them when it fails, and then get stuck manually calling back and explaining what happened.
This is a masterpiece of framing by Google. Am I the only one who does not believe for a second that this technique will be mostly used to do calls for consumers, but to call WITH consumers? We will get back to asking hotlines strange questions to break the algorithm and reach an actual human.
Sigh, guess it won't be long before we're all asked to jump through some ludicrous human challenge to confirm we are not a robot every time we call someone
> [...] we trained Duplex’s RNN on a corpus of anonymized phone conversation data.
That's alarming, and thinly cloaked in euphemism (IMO). "Conversation data" here means recordings of actual human-to-human calls, as well as their automated transcriptions. Both were used.
The examples they give are quite impressive but I always want to hear some examples of failures as well with these types of technology. They mention that the system is self monitoring and will try to detect a situation it can't handle and redirect to a human operator but I think some examples of situations it can't handle or where it gets things wrong would be very useful in understanding its real world robustness.
There are plenty of places in my town that don't have online ordering or booking...this would (potentially) help with that...but really, I'd rather there be more tech be built around enabling theses companies to "be online" such as an online ordering or booking system. When I find a hair cut place, why can't I just book an appointment right from the Google Maps search? I'd rather have that type of convenience instead. Ironically they may add that at some point, and Duplex just calls on my behalf without me even knowing.
> Ironically they may add that at some point, and Duplex just calls on my behalf without me even knowing.
Yup, good observation; seems highly likely, save for the 'without even knowing' bit... I would imagine you'd 'request to book' and an async Duplex operation would run in the background and send you a notification of the outcome / possibilities.
I can imagine this being an extension to robots.txt - i.e. do I mind my business being robo-called, or would I rather send an API call to <x>. This is an interesting point in AI technology.
Just wait until a Google Duplex caller "on behalf of a client" calls to schedule a reservation at a restaurant using Google Duplex to answer the phones.
Seriously. Instead of a single, clean restaurant reservation HTTP POST API, the future is two neural nets modulating and demodulating the request to and from inexact and potentially ambiguous English audio.
Silly. The future is a stenographic handshake in the initial greeting, which negotiates an upgrade to a proprietary gRPC8 protocol when the caller and recipient are both Google, which Google uses to get a monopoly on telephone-mediated social interactions which it can then monetize by building a social graph to more efficiently target advertising to captive audiences riding Waymo cars.
To be fair, we could make the same complaint in regards to the web - it's largely all plain-text on a line, as opposed to some form of compiled bytecode (I know, it's coming).
What we lose in using human speech for precision we make up in it being pretty much universal. Talk about an adaptable interface. You can phone the restaurant and do anything from reserving a table to ordering takeout to informing them that their cat is on fire.
(I mean that as both a joke and a real comment - you could never force every restaurent in the world to learn REST, but you sure can call a bunch of them)
I think Neal Stephenson would make the case that we started down that inefficiency road when we replaced telegraph signal with voice in the first place. ;)
I wonder if it would be able to tell if it was talking to another Duplex bot and instead of speaking in English, it would communicate more efficiently.
Since the restaurant in this situation is clearly using a digital booking system already why would Google (or whatever service inherits this kind of bot call) not just check on the popular booking sites before placing the call?
Yes, it would be better if those businesses had systems/APIs for those transactions to be done, but we still live in a world where scanning a sheet of paper that was printed and mailed to you with a phone often reduces friction on the action of moving money between two accounts.
That being said, as someone who has used "assistant as a service" stuff before, I wonder how well this will work or how limited it will have to be, and not just because of the AI itself.
Even with humans on both sides, it's amazing how hard it can be to get an answer let alone a request fulfilled in a single phone call.
Questions about table placement, food allergies or other restrictions could come up, which I wouldn't want going to an operator. I'd rather be told in advance that it is calling the restaurant and have it send me questions in real-time with suggested answer buttons until it learns enough about me not to need them.
In other cases, just having it call and stay on hold for me would be useful. "Ok Google, call my mobile operator so I can talk to someone at around 10am" and having it use data it already has from other calls to place the call at around 9:45 and patch me in at the right time would be useful.
Currently there are about 3 million people working in call centers in the US alone and millions more in other countries [1].
Given the technological trajectory, over time there will be less need for people who serve purely as the ‘interface’ using minimal skills and knowledge. At the same time, we still need many more people to work in the physical world: cooking nutritious meals, construction, and caring for the elderly are some examples.
Since we cannot assume that everyone can develop skills needed to thrive in demanding technical or knowledge-based jobs, a key priority in many countries should be supporting certain segments of the population to develop the skills and attitude necessary to work in these physical jobs: Most of which are too complex for AI and robotics to effectively replace in the next few decades.
In addition, vocational education should be improved and updated to make use of appropriate technology to increase productivity and reduce physical demand on the body.
During the Battle of the Bulge, the Germans infiltrated the American lines with fake Army officers who would give unproductive and confusing orders. The impostors spoke perfect English and often had been raised in the US.
The GIs unmasked them by asking them questions about baseball and shooting any with a wrong answer.
It's worth noting that Google Assistant has been able to make reservations with OpenTable since 2014. No need to have Duplex talking to Duplex; it's just protobufs or JSON or XML or whatever people use nowadays for RPCs.
Google Duplex is about helping the long-tail -- it's work that is done by studying the needs and processes of the smallest of small businesses, and tailoring a product just for them.
I've lost count of the number of college hackathon projects where they say, "oh, push a button and you get a pizza" and they think they'll just put an iPad in the kitchen, and then fizzle out when they get to a real restaurant.
In practice, the restaurant might pass around pieces of paper in the kitchen. So you think, oh, I'll put in a thermal receipt printer. But then you realize that they don't have Wi-Fi or internet, so now you have to put in a $70/month internet bill, on top of the phone bill, and a router or two. So you think, "oh, I'll use a fax machine", or "I'll integrate with the point of sale system". But the fax machine runs out of paper, and the point of sale system is an offline piece of ---- running Windows XP. And even if you do get them using an iPad or OpenTable or Yelp or whatever, before you know it, you have waiters writing on a computer monitor with a whiteboard marker: https://javlaskitsystem.se/2012/02/whats-the-waiter-doing-wi...
But every one of these businesses has a telephone number, whether it's a landline or a cell phone or whatever.
The algorithm of cause doesn't understand all contexts. What troubles me is that, in the first example, should we really give algorithm the freedom to propose a new date for an appointment? It reminds me one thing that particular bothers me with the Gmail's smart reply feature, where when given Monday or Wednesday as options, the suggested reply is, 'How about Tuesday', which does make the conversation flows, but doesn't really make any logical sense.
It makes a good demo, I am very much impressed, however, I feel it will run into a LOT of issues, even only in those provided scenarios, should those scenarios become more sophisticated.
Exactly this. I put together a rough demo of similar AI scheduling in the past and x.ai does this as well. The AI works within a given constraint before proposing new times because the idea is to free up time for you, not make you clean up a scheduling mistake every time something is decided by the AI.
I had the impression that people got uncomfortable by how fast and dry it says "k--thx", as if it were annoying to talk to the person.
In all cases they kind of changed their flow on the hanging up part. I don't know, some seemed nervous. I guess they feel that the machine is quite clear and dry.
Also, the girl in the last audio, got a flick flirt haha even said, rushing, "see you next friday" (or something) when hanging up.
Once again, my impression.
edit: It came to me that, at some point, it will be able to wander off a little, giggle and stuff. So creepy!
Interviews are dreaded by most of the engineers.
What are the chances that Google might be testing this in the wild. Given the number of applications they receive.
I would have preferred it if this technology was developed at a university. It seems that the academic world is being replaced by the corporate world at a worrisome pace.
I wonder, if this will be confined to one-party consent states/countries for the foreseeable future.
Google will most likely want to use recordings to keep fine-tuning and improving upon Duplex, and I don't see them announcing "This call is recorded by Google.", when they're going through such great lengths to convince the called parties that they are talking with a human being.
There is something about this that is just a bit unpalatable. Like getting businesses to define their products and services in a common computer-usable format and interface is such an insurmountable problem that we rather build million-to-billion dollar super computers so we can skirt the problem and regress back to the lowest common dominator of communication -the spoken word.
I'm actually surprise of how many people is impressed by this. I have been using google's assistant technology for a while now. It's amazing! I just wish they release the new voice ASAP! This is what you can do with it:
https://vimeo.com/251603335
I've been building something similar to this for a while called Interval - it's a plain booking engine for small businesses. Obviously nowhere near as good of speech recognition, but it's also not tied to Google, or their baggage. Works over SMS or fbm.
Google duplex works on domain specific conversation on which they are trained. why can't we have an AI system which can learn the language from A to Z and all the dictionary words and understand and then speak or read anything normally, the way human does.
Just a side question but related: Is there any good-and-new ML research/model/example for "Language detection"?
For example I have a conversation in both English/Russian and I want to segment the input according to each language then handle each language separately.
Is any product remotely as close to this with natural-sounding human speech? Are we entering an era where all voice assistants are getting to that "Her" level in terms of human-like vocal quality?
That part is as impressive to me as the semantic parsing it's doing on that call.
Not that I've seen. The most advanced IVR systems that some banks are using have voice fingerprints and recognition etc but the TTS is still noticeably robotic.
Other companies are using huge libraries of recorded human voice for communications and concatenating them together in intelligent ways.
There's a far-future sci-fi novel where "please" and "thank you" are considered insults when spoken to people...bc that's how the wealthy prefaced and ended their computer commands. Vinge, perhaps?
I hang up on robocalls now... I guarantee this service is going to get hung up on more than it will work. People don't like getting treated like crap at home, why do we need to treat them like crap at work.
There's a difference between a spam robocall to your personal number and a robot calling a business line to inquire about legitimate business. I imagine if my job were answering phones I'd much prefer talking to a robot that speaks clear English over actual people who could have difficult accents or who might just be rude on the phone.
This is really impressive and all, but what's wrong with saying upfront this is a robot, let it speak it's robotic accent, and let the human know it's speaking to a robot.
Assuming this tech is just to bridge current times when most businesses aren't fully automated, and the end goal is to replace BOTH ends of the conversation with robots, in that case the conversation protocol doesn't have to be English, or even voice.
Can it also adjust the accent or tone to make it understand to the other party ? This will be really helpful if I don't have to say my name as R for rock, o for ocean ...
I am the only one that sees totally that this product is result of work done for military? Task: Sergeant Google, use everything you have to analyze all audio streams in real time. There are multiple enemies to look for, be alert! Google: Ay, ay chief. Can i use this for my troops to order pizza? Chief: Affirmative. Do it, this is Murica, pizza is important.
Just wait for the business to install Duplex on their end, and realize that we have just created the least efficient computer communication protocol ever.
I wonder how this will work in situations in which botting for reservations has been frowned upon, e.g. the restaurant world's version of high-frequency trading:
> This summer, we’ll start testing the Duplex technology within the Google Assistant, to help users make restaurant reservations, schedule hair salon appointments, and get holiday hours over the phone.
I don't see any details on how this is beyond the research phase.
I’m wondering about the legal implications of such feature. By making calls and reservations for the customer, does Google Assistant (and therefore Google) become the agent of the customer as a principal? If it does, shall it consequently take up legal obligations of agents (e.g. reasonable care) and potential responsibilities for its nonfeasance thereof?
Does anybody know if the tech powering this has anything to do with Google's quantum computing efforts (D-Wave, Bristlecone, etc.)? I feel like it does, especially the natural speech generation.
Reason I'm asking is, I am interested in understanding what the intersection of/link between AI/machine learning and quantum computing is, if there is one.
This could be great for GOTV efforts and general canvassing.
Imagine a system designed in the voice of the candidate that can call you and answer most any questions you have about the platform (or log when you don’t have an answer to be updated later), remind you when to vote, send you a Facebook friend request, etc.
The problem they are going after is that small businesses don't have an online appointment system. Wouldn't it be simpler if they made one and offered it for free to all businesses? The voice demo was fun to borderline creepy, but are we really at the point where this can work at scale?
Neural networks are a fundamentally statistical technique, so you find the state of the art whereever you find the largest scale. No one is operating at larger scale than Google, which has the second order effect of attracting the best talent, which then gets sorted to the highest priority problems. There are a non-trivial number of instances where Google will beat their own state of the art, and I'm quite confident they are sitting on further results to avoid a public embaressment of riches.
I'm curious to see how this plays out. This assistant has a limited set of voices. Imagine an employee receiving reservation requests at a restaurant, being called up dozens of times on the same day, with the same voice, for dozens of different reservations.
My Google calendar has an occasional duplicates problem, I'm not so sure that Google won't one day make 3 identical restaurant reservations for me by mistake.
Hoping that once they've solved robot calls, they'll probably have a go at some of the harder things like synchronization ;)
This is neat. I'm looking forward to the day where two phone AIs get stuck in some amusing conversation loop.
I wonder if this sort of technology will result in some sort of arms race / singularity where everyone, businesses and consumers alike, ends up needing to use phone AIs to stay sane.
This seems really backwards. Talking to people over the phone is a very easy task. Getting through the automated phone system to the person is usually the hard part. Automate that, and you'll be at the top of every app store.
I worry that this will make people have an AI make phone calls for them regularly when they wouldn't have in the past. At some point, the restaurants will need to have their own AI to respond to the call volume. You end up with 2 AIs attempting to have a conversation.
If the main impediment to dining at a restaurant was having to reserve a table, I think that would be a real possibility. But the main impediments are usually cost, complexity in getting to the restaurant, unfamiliarity with how good the food and service are, agreeing with the rest of the party about the timing, etc.. The actual booking of a table is a minor stepping stone on the road from the thought to the actual execution of the plan.
let's see if this works future google find and quote this comment back to me when i am 85 _use a heavy 17th century ural peasant accent, slurring voice to imitate heavy drinking, then proceed to sing me a song about your lost love, don't read the edited part_
I wonder if we could solve phone spamming with this? By having the bot call back any robocall numbers to get one of their human salespeople on the line, and just waste their time. If enough people did it, would it destroy that shitty industry?
Phone centers got outsources to India and other places a while back, and while they spoke the language there was definitely a drop in quality of the resulting service. I dread to think what this will be like.
The most terrifying aspect for me is not massive job replacement, but such AI used for robocalls, fraud, ... Imagine you got a phone call that sounds exactly like your mom and asking for help!
I'm pretty sure it's something they have been using themselves in the past few years to gather the opening hours of every single commerce listed on Google Maps.
I have to wonder if lawmakers will force companies using technologies like this to have their assistants identify as virtual before proceeding with the conversation.
It’s funny to me that most places where you can make any kind of appointment online usually have a captcha. I wonder if they’ll start having a verbal captcha on phone calls.
What do you think the likelihood of an open source implementation of this would be in the near future? Either by them or, by them releasing the research?
When the robo callers get this hell will be unleashed. The only remedy is we all have Duplex-like service answering the phone for us, and let them duke it out.
Does Google say anywhere these were all real calls? Or did they call back to cancel the appointments? Because it would be really easy, and tempting, to just fire off 10,000 of these calls to businesses around the country, just to harvest data on how well it does. And leave a massive trail of fake bookings. Even if Google wouldn't do this, the next company attempting this will.
Avi Ovadya among others has talked about a coming infopocalypse in which fake anything can be generated. Combined with data troves from Facebook or wherever they can simulate actual people to completely steal identities on a large scale and create a society disrupting mass hysteria to start a war or a mass panic.
IMHO, to fix this everything needs two factor authentication generated by a biometric scan in person at a government office. Yes you could use a blockchain for it too.
At the same time, 9out of 10 times when my wife calls me the system pretends it's ringing, while I see no sign of a phone call. Both of us are on Project Fi. (We tried it standing next to each other both of us with full LTE). I guess guess google figured the people will blame each other, not the network. Anyway, I can't keep disabling all the smart features of smart phones, I might just get a new Nokia 3310.
As a semi Kurzweil fan I've kind of been expecting his Turing test by 2029 prediction. Things seem pretty much on schedule, perhaps even a tad ahead. Though we'd some breakthroughs to move closer to strong AI. But the hardware is on track and there are an awful lot of top of class PhDs working on the stuff.
"To obtain its high precision, we trained Duplex’s RNN on a corpus of anonymized phone conversation data. The network uses the output of Google’s automatic speech recognition (ASR) technology, as well as features from the audio, the history of the conversation, the parameters of the conversation (e.g. the desired service for an appointment, or the current time of day) and more."
"Anonymized!" "Honest!" says the surveillance-capitalism advertising mega corp...
Luckily I've never been able to use Google Voice - but I doubt they're the only threat actor using phone conversations and metadata to train neural nets... Pretend anonymized or not...
Add me to the list. Love GV as a great tool for so many use cases. Great for handing out a phone number that is not really your phone number. So for something like students and certain office hours.
Interesting prototype but not practical beyond trivial use cases yet (plus this is clearly a guy's version of booking a hairdresser.) Generally I'd have other requests like it's a date can we have a quieter side, or view of the game on tv, or it's a birthday party... Also impressed that it could interact with a person "naturally" but ethically the other person should be told that it's a bot and have an option to ask for a callback.
People who answer phones to take bookings perform an extremely limited set of questions and responses, that’s why they can even be replaced by dumb voice response systems in many cases.
In these cases, the human being answering the phone is themselves acting like a bot following a repetitive script.
Duplex seems trained against this corpus. The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.
Most people working in hair salons or restaurants are very busy with customers and don’t want to handle these calls, so I think the reverse of this duplex system, a more natural voice booking system for small businesses would help the immensely free up their workers to focus on customers.