Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone

cromwellian · on May 8, 2018

The people losing their marbles over this being some kind of Turing Test passing distopian stuff are missing the point at how limited this domain is.

People who answer phones to take bookings perform an extremely limited set of questions and responses, that’s why they can even be replaced by dumb voice response systems in many cases.

In these cases, the human being answering the phone is themselves acting like a bot following a repetitive script.

Duplex seems trained against this corpus. The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.

Most people working in hair salons or restaurants are very busy with customers and don’t want to handle these calls, so I think the reverse of this duplex system, a more natural voice booking system for small businesses would help the immensely free up their workers to focus on customers.

freyir · on May 8, 2018

> The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.

And looking even further into the future, we can imagine a day when the computers forgo natural speech and use a better-suited form of communication. Some kind of sequence of ones and zeros transmitted directly across the wire.

ZainRiz · on May 9, 2018

Lol, but if you think about it, what stops businesses from doing this today?

It's the lack of a universal API.

If a barber shop wants to make it possible for a 3rd party app to book appointments then they have to release some API. But that's not the end of it. The 3rd party app has to first discover their Api, someone has to understand it and write code to use it, and then deploy that code.

This is a problem today because there is no universal Api that all services can use

With Duplex, verbals speech becomes a universal Api that every service can parse and communicate to each other wtih. Also, the discoverability is taken care of by using publicly cataloged phone numbers on services like Google Maps, Yelp, etc

jen729w · on May 9, 2018

May 2001: “XML: the universal language?” https://www.computerweekly.com/feature/XML-the-universal-lan...

I recall a Wired article from the same era. “XML means your doctor’s system can just talk to the hospital system even though they’re different!”

Hasn’t happened yet... will it? Can it?

xroche · on May 9, 2018

> Hasn’t happened yet... will it? Can it?

Nope. XML (or Json, etc.) are just "human-readable" presentation of data. It does not provides any semantic whatsoever.

So you need some semantic on top of these data. And a general-purpose, universal API is yet to be invented (hint: it is probably not feasible)

sitepodmatt · on May 9, 2018

Microsoft and others enterprise selling vendors loved the end goal back in early 2000s - the universal API solved by middleware. That's why you had Biztalk and Biztalk consultants that made more than SAP consultants (think todays crazy Salesfarce consultants that compete for gamification badges). For example you could be a small insurance company submitting to a larger underwriter, and when you work out the transactions per month you have to take $5 off each app just to pay for biztalk infrastructure and licensing. People rode that gravy train hard. I'd be surprise if any of the biztalk shit still remained though, grand goals means juice enterprise sales. Oracle had a similarly crap product that was equally slow, painful and verbose, can't recall the name. XML and it's lofty goals beyond what it was can be compared to today's ICO toxic industry, no reflection on XML itself though.

robterrell · on May 9, 2018

In the 80's, it was EDI -- electronic data interchange, a set of schemes for sending binary formatted business data, like invoices and POs.

killjoywashere · on May 10, 2018

Don't forget HL7

cjonas · on May 11, 2018

The "badges" are not won in competions, but instead watered down tutorial gold stars. Instead of targeting real programmers Salesforce built "trailhead" for John in finance who wrote some Excel macro and decided he should become a Salesforce "developer". Thats why the small percentage of us consultants who actually come from a cs background can charge so much.

sitepodmatt · on May 11, 2018

I think I agree with your sentiment, but I'd think you'd be surprised how much a click next salesfarce developer charges...

asdgionui · on May 9, 2018

Of course it's feasible. It's called English!

Piskvorrr · on May 10, 2018

Quoi? ("...fetchez la vache!!!")

taneq · on May 10, 2018

I read a great article about XML that spelled out that XML isn't a "language" or protocol, it's an "alphabet". It gives you the building blocks but what you build with it but it doesn't translate from one language to another. (I guess that's what XSLT is meant to do but it's still not magic.)

andai · on May 9, 2018

JSON? And no doubt in another ten years, something else!

xyproto · on May 9, 2018

Luckily, the English language never changes. Wait.

rmgraham · on May 10, 2018

But compared to the rate of change in technology, it's practically written in stone!

pjc50 · on May 9, 2018

Much of the "semantic web" work was directed at similar things. The reason we don't have it already after 20 years of web commerce is that it's an adversarial process. The business wants to mislead, upsell, or discourage a customer from asking for support; and likewise some small fraction of customers are looking to exploit the business.

Bots that automate UI tend to get banned.

madrox · on May 9, 2018

English, an API with over 1.5 billion clients in the wild

WhiteNoiz3 · on May 9, 2018

If you think about it, conversation is just a loosely defined API with an extremely high degree of tolerance for poorly formed input.

jamesmiller5 · on May 9, 2018

I mean we are all just emotional processing silos that trade information and process it right :) ?

hestefisk · on May 9, 2018

That’s a very Wittgensteinian view.

Piskvorrr · on May 10, 2018

This thread now contains a loop. Granted, it's a really long loop, all the way back to 1956, but it's a loop all right: you've essentially summarized the Dartmouth Conference and its assumption "gee, that ought be just a minor research subject - easy, right?"

https://en.wikipedia.org/wiki/History_of_artificial_intellig...

ben_w · on May 9, 2018

If it was that straightforward, we’d already have bots that pass the Turing Test.

Whatever natural language is, it’s not an API. Might have some overlap, but it’s different.

nwsm · on May 9, 2018

Natural language is not an API. But a business can assume (or weed out quickly) that caller is calling to hit a specific API endpoint. And as such their language will eventually lead to one of those endpoints.

And similarly, a caller can assume the business they are calling is trying to lead them to an endpoint.

In both cases, a set of assumptions lets natural language act as an API, even if neither end could pass a Turing test with someone who wasn't interested in any of the endpoints.

edit: I read your point about language changing, and that is true. But if we only have machines using the language with each other (no more training with humans), we can also assume the language won't change.

uryga · on May 9, 2018

Nobody said it's easy to implement :)

I'm curious - what differences do you have in mind?

danShumway · on May 9, 2018

APIs map to a single canonical concept. A "class" in Java actually has a correct definition. There are a nontrivial number of philosophers who believe that this isn't true for language.

To simplify that and put it in more technical terms, an API is perscriptive and a language is descriptive.

If a bunch of coders decide to start capitalizing "Class" their code won't compile. If enough people start using the word "aint" it becomes a word, regardless of what the dictionary says (see "irregardless"). There is no single authority that can decide what is and isn't a canonical definition.

This is why spoken languages evolve so much. Even languages where we've explicitly tried to go the opposite direction (like Esperanto) have evolved into multiple dialects, where subsets of the community simply ignore the standards and still communicate with each other just fine.

Note that this is the opposite of what you want with a federated, universal API. The whole point of an API is to standardize between unfamiliar devices. Language is actually pretty bad at standardizing communication between unfamiliar people. Even in the US, different regions and communities use different euphemisms, terms, and definitions.

pharke · on May 10, 2018

You're right that the language is not the API, it is the medium. The "appointment booking API" exists only in our heads, it is a set of mutually understood conventions for interpreting a subset of language in order to create a record of the appointment somewhere.

If you call up a hair salon and start reciting poetry you'll get an error response in the same way as you would if you had sent malformed JSON to an endpoint. If you stick to the expected script you'll achieve success almost all of the time.

I wouldn't be surprised if a majority of human communication works this way, especially when it involves individuals who do not know each other. We have agreed upon limits, key phrases and words, and expected responses that allow most of the unpredictable stuff to be ruled out. All of that favours automation.

danShumway · on May 10, 2018

> We have agreed upon limits, key phrases and words, and expected responses that allow most of the unpredictable stuff to be ruled out. All of that favours automation.

I'm not sure I'd disagree, but it seems you're just describing a domain specific language in a more roundabout way. We have a ton of protocols that introduce a set of limits, key phrases, words, and responses. Java, Network protocols, XML, JSON, etc...

Assuming that you're correct, does it make sense to then assume that it'll be an improvement to standardize English rather than a set of IP headers? The agreed upon standards in English are (for the most part) informal, evolve constantly, and are hard to teach to computers. Automation favors predictability, and even the most generous interpretation of a natural language leaves me feeling like it's a step backwards.

We're going to standardize on an appointment booking API that exists only in our heads, that can only be taught using ML, and that is guaranteed to change over time in unpredictable ways? That seems wrong to me.

ben_w · on May 9, 2018

Just to add a fun example to demonstrate your point, the original name for Esperanto wasn’t “Esperanto”, it was (translated) “the international language”. It was given the nickname “Esperanto” after the chosen name of the creator, and the word itself it supposed to mean “one who hopes”.

Piskvorrr · on May 10, 2018

Pedantry ahead: the various implementations of (J(ava)?|ECMA)Script tend to disagree with the point on pREscription and non-divergence (yay 20 MB of stubs and libraries for sort-of-coherent behavior). Moreover, the example of "Class" is just a pre-compile check convention, the compiler doesn't care.

(Postel w/r/t API - Does that make any sense? Or is that a fancy way to say DWIM?)

danShumway · on May 10, 2018

:) I don't think that's pedantic, it's a really good point to bring up.

On the web side of things, the W3C often describes their role as being partially descriptive.

From their doc on the Web of Things[0]: "The Web of Things is descriptive, not prescriptive, and so is generally designed to support the security models and mechanisms of the systems it describes, not introduce new ones... while we provide examples and recommendations based on the best available practices in the industry, this document contains informative statements only."

This is exactly for the reason you mention - if browsers collectively decide to go in a different direction, what the W3C says doesn't matter. The web standard is what the browsers do.

However, two things to keep in mind:

Even where browsers are concerned, there is still an API and a canonical version of "correct" for each browser. What we're trying to do is get those APIs to be compatible and consistent with each other.

Many people believe that language even on an individual level doesn't directly map to an actual reality; in web standards that would be like the browsers themselves not having their own consistent API.

But assume those people are wrong for a sec. Let's assume that language is just a standardization problem between different communities and individuals. Well, the W3C should teach us that even in the realm of computing, standardization is really stinking hard.

So even in that scenario, we have to ask whether standardization becomes easier or harder when every single individual in a community has the ability to change norms or introduce more language. We can't even get 3-4 browser manufacturers to agree on a single API, now imagine if every single hair salon owner could increase divergence whenever they wanted just by answering phone calls differently.

[0]: https://www.w3.org/TR/wot-security/

state_less · on May 9, 2018

True, many people know english, but that doesn't mean they know statistics or some other complex domain. "Show me a k-means cluster of this dataset", is likely to be parsed but not understood by many english speakers.

asfasgasg · on May 9, 2018

Fortunately, Duplex's purpose is booking salon appointments, not to provide a conversation partner about statistical theory.

state_less · on May 9, 2018

I was attempting to communicate to the parent and grandparent that transcribing/parsing english and communication are different. We both need to share the model being spoken of to transfer or communicate the knowledge.

akskos · on May 9, 2018

yep, english is the protocol (like http) and the domain in question (for example booking an appointment) defines the (very loose) api

chickenfries · on May 9, 2018

My god no one in this thread has a clue about the complexities of natural language. English is not a protocol like HTTP, it's a natural language.

asfasgasg · on May 9, 2018

I think you may be confused. I suspect everyone in the thread knows that English is a natural language.

nyanotech · on May 9, 2018

Optional functionality :P

hipjiveguy · on May 9, 2018

smadge · on May 9, 2018

In a sense you are correct, but remember that Duplex only works because it is limited to a very strict well specified domain: scheduling for restaraunts and hair salons. That is, Duplex only works because there already is a de facto specification for these transactions. It requires a lot less effort overall to just formally specify this de facto standard and deploy it once and for all, but of course a top down approach rarely works. This actually reminds me of the failure of the industry to adopt semantic web technologies. Many organizations are providing the same services and could easily adopt a common “API.”

TeMPOraL · on May 9, 2018

> Many organizations are providing the same services and could easily adopt a common “API.”

It's not in the interest of those organizations to adopt a common API. Everyone wants to suck in data and be the platform; nobody wants to give data away.

freyir · on May 11, 2018

But that's the reason why Duplex is interesting, and why there's a grain of truth in what ZainRiz is saying:

Humans have settled on a de facto API for scheduling appointments. It uses the telephone as its interface, speech as its medium, and Duplex is exploiting it.

Piskvorrr · on May 10, 2018

...but Google is one of the giants that could pull that off and become The Platform.

ZainRiz · on May 9, 2018

You first need "ye old barber shop" and "big corp barber shop" to all agree on what this common API should be. That's the old standards proliferation problem https://xkcd.com/927/

Which is what makes it very hard to define a new common API

However, they all already agree on the standard for natural language communication (in the context of a strict, well defined domain). That's the pre-existing common API which Duplex is using

TeMPOraL · on May 9, 2018

It's basically what 'pjc50 said - it's not in the interest of businesses to expose their data via a common API.

WRT. using English as universal API, I think this is just dumb. You solve exactly zero problems by going that route, because the actual problems to solve (beyond no incentive for businesses to care) are exactly the same as you have with XML APIs, or any other APIs. The problems of discoverability and machine understanding is something the Semantic Web space has been dealing with for quite a while, and other people before that. Adding natural language to the mix only makes the job significantly more difficult, because you now have to deal with natural language parsing/understanding.

nyanotech · on May 9, 2018

> It's the lack of a universal API.

This sort of thing is exactly why the healthcare industry still uses faxes, even going electronic charts -> pdf -> fax -> pdf -> electronic charts in some cases.

zeta0134 · on May 9, 2018

This is even more fun because in the modern age, what very often ends up happening is:

electronic charts -> pdf -> fax -> fax machine as a service -> unsecured email -> pdf -> electronic charts

Compliance can sometimes help, but ultimately the data needs to flow, and people will do whatever it takes to make that happen. Until security is so easy that it's the default, these little loopholes will continue to be abused.

rabidonrails · on May 9, 2018

Phaxio co-founder here. We do a _ton_ of heathcare faxing and we're starting to see a shift away from the "unsecure email" in applications. Granted, we can't see what our users are doing at all times but being HIPAA compliant ourselves, we often work with our users to understand their systems and guide them towards compliance.

>> Until security is so easy that it's the default, these little loopholes will continue to be abused.

The simple way to think about this is that the government is more worried about unsecure email/email spoofing than it is about wiretapping.

bm1362 · on May 9, 2018

To be fair, you’ll notice if 150 million faxes start going off rather than someone breaking abusing your API.

dragonwriter · on May 9, 2018

Healthcare uses faxes mostly because HIPAA rules particular to format and security of electronic communications don't apply to faxes; it's a compliance hack.

Lorkki · on May 9, 2018

That sounds a bit too juicy to be true. Any citations?

dragonwriter · on May 9, 2018

I've literally been in the room when legal and compliance offices gave the advice on both the construction of the relevant regulations and industry practices on which a payer relied on in deciding to use a process that created paper documents then faxed them for certain purposes, but, no, there's nothing published I can link to as to that being the reason industry players make that decision.

I can, however, point you to the relevant section of HIPAA regulations on which it rests, the definition of “electronic media” at 45 CFR § 160.103, specifically this bit: “Certain transmissions, including of paper, via facsimile, and of voice, via telephone, are not considered to be transmissions via electronic media if the information being exchanged did not exist in electronic form immediately before the transmission.”

cm2187 · on May 9, 2018

And in a way it is justified, you need a warrant to wiretap a phone line but no such constraint on eavesdropping on TCP/IP communication.

dx034 · on May 10, 2018

> if the information being exchanged did not exist in electronic form immediately before the transmission

So you need to print them out before faxing? PDF->Fax wouldn't work with that definition.

ynniv · on May 9, 2018

The one or both speakers could use a handshake noise at the start of the call to tell the receiver that it's capable of "speaking" a modem protocol. It might change a little every time, or be of an especially low or high frequency so that a person doesn't realize they're talking with a computer. After handshaking, the receiver could send a URL that would allow the channel to be upgraded to the Internet... or not. English is a good fallback if both people speak it and you can't find a more efficient channel.

Piskvorrr · on May 10, 2018

Revenge of Cap'n'Crunch :D

nothis · on May 9, 2018

Now that you mention it, this super advanced AI project sounds like a failure of software standardization rather than a triumph of technology, lol.

flukus · on May 9, 2018

> The 3rd party app has to first discover their API, someone has to understand it and write code to use it, and then deploy that code.

I think this would be a better integration point for AI. It could look at the fields and learn to fill them out automatically (name, age) and prompt the user for anything missing. Then instead of the barber shop needing a universal AI users just need their personal AI (or a script) to interact with the API.

collinmanderson · on May 9, 2018

Re businesses doing this today: I believe WeChat handles this very well in China.

sowbug · on May 9, 2018

It reminds me of the maybe apocryphal story how NASA invested time, money, and effort to develop a pen so astronauts could write in zero gravity, and the Soviets used pencils.

8bitchemistry · on May 9, 2018

A great story, but you are correct about it being apocryphal: https://www.snopes.com/fact-check/the-write-stuff/

sowbug · on May 11, 2018

And one internet binge later, I am now the proud owner of a matte black Fisher 400B Space Bullet Space Pen. Cool story. (The real one, I mean.)

aurailious · on May 9, 2018

Sometimes the simple, old, and reliable tool isn't the best one for your environment.

indemnity · on May 9, 2018

Both used pencils initially, and both switched to this same pen.

nyanotech · on May 9, 2018

Pencils have the disadvantage of putting out conductive graphite dust, which, in 0g, will float around for a very long time.

s73v3r_ · on May 9, 2018

"Lol, but if you think about it, what stops businesses from doing this today?"

Systems like that are much more expensive than paying a receptionist?

nathan_f77 · on May 9, 2018

That doesn't sound right. I think you'd pay something like $99 per month for a SaaS product that manages bookings and provides an API. That's how much the average receptionist earns in one day ($12.25 per hour.)

hug · on May 9, 2018

Things that you appear to be missing from your post (but probably totally already know and are just not bothering to mention):

* Most businesses already have a receptionist.

* Most receptionists do not spend measurable fractions of their day answering phonecalls asking when the business is open.

* Taking bookings is really also not the majority of their day.

* Receptionists are capable of a bunch of things that your SaaS booking program is not. Like ordering catering and picking up office supplies.

* A SaaS booking program that is looked over by a human doesn't have to have AI-systems, because they just have I-systems. A human receptionist.

* The inevitable job-post catchall "Other duties as required."

* I had a receptionist bring me a beer once while I was waiting and I'm pretty sure none of your SaaS solutions will do that.

salehhamadeh · on May 9, 2018

And $20,000 to get someone who knows how to install such a system to do the job.

indemnity · on May 9, 2018

What is installed when using a SaaS solution?

louiskottmann · on May 9, 2018

Well there is initial setup at the very least. Hooking up whatever landline the client has with the SaaS solution.

Then whenever a significant change is required, you need to call back your "expert". New location? xk$, etc.

But I think the biggest concern is that suddently the owner does not understand how his reversation system works. He used to be able to call Joe and know what's going on...

TeMPOraL · on May 9, 2018

Also, whatever the SaaS system is, I bet you it's significantly less efficient than a receptionist using paper and desktop software.

Web systems are almost never built for productivity.

dzhiurgis · on May 9, 2018

It is customised against each business model. Some are same, some are different.

The issue isn't universal API, but universal data models which is probably impossible.

s73v3r_ · on May 9, 2018

Someone needs to maintain that API; update it with the schedule of the stylists or update it for holiday hours, or remove availability for bookings done the old fashioned way.

siva_ss · on May 9, 2018

"This is a problem today because there is no universal Api that all services can use" "It's the lack of a universal API."

I disagree with that. We already have universal APIS. Adopting a newly established Universal Api is far more painful and has slower adoption rate than using the existing-globally-reached one like a telephone. Google duplex like systems addresses a broader scope of computer verbal communication and it feels like a step in the right direction.

ZainRiz · on May 9, 2018

What's a universal API that everyone agrees on?

It's the old Standards Proliferation problem: https://xkcd.com/927/

henrieri · on May 9, 2018

If both sides were to use Duplex, it would already know to just send 1010 instead of verbal communications.

Also if it was unknown whether the opposing person was a bot, a bot could firstly send a common test according to some protocol to ask if the other one was bot by some kind of sound representing that. In which case both would start sending machine readable information to each-other.

extralego · on May 9, 2018

Sounds a lot like a dialup handshake.

snvzz · on May 9, 2018

Or just hang up automatically.

hedgedoops2 · on May 10, 2018

If I remember correctly, CORBA was all about standardizing APIs. You'd have your distributed "CORBA Objects" whose methods anyone who knew the unique object ID could call. The idea was that there would be a "standard library" for each industry. So all the barbers would implement a standard library "Appointment Schedule" object, all the exchanges would implement a stdlib "Orderbook" object, etc.

CORBA would generate RPC stub objects for you in various OOP languages, and potentially automate discovery, so you could say, give me all an array of all the orderbooks of all the bitcoin exchanges, and ask each for the last price.

adtac · on May 8, 2018

Lol imagine my phone talking with an automated customer service line. Two machines, talking to one another, not using any of the existing protocols. My phone would have a database of questions to ask, form it into an English sentence, run it through a text-to-speech, transmit this to the other phone. Their "phone" would run a speech-to-text, run NLP, match it with its own database, and do the whole thing again in the opposite direction.

This gives a whole new meaning to "all of UI/UX is basically prettifying database queries".

bbleciel · on May 8, 2018

"all of UI/UX is basically prettifying database queries" - are you saying that generally speaking or is that a reference an article or articles?

Would love to read more about it if so-

nyanotech · on May 9, 2018

Pretty sure this is just the idea behind MVC.

kpennell · on May 9, 2018

me too.

ChristianGeek · on May 9, 2018

Potentially there would be a way for both sides to recognize that the other side is a machine and they could switch to binary communication.

dannyw · on May 9, 2018

Binary communication isn't useful by itself. You need a compatible protocol and API.

I can give you 010101010101101011111 to a machine all I want, if they don't know how it's formatted, it's useless.

Conversational English is a format.

TeMPOraL · on May 9, 2018

> Conversational English is a format.

And as you said, format isn't enough. You need semantics.

If two applications know enough about the other side to know how to formulate their voice queries, they know at least enough to exchange those same queries as text, and skip the stupidly wasteful text->speech->text process.

(And if world wouldn't be so full of adversarial practices driving engineering stupidity, the developers would agree on an efficient binary format beforehand.)

dmihal · on May 9, 2018

The simplest "API" would be skipping the text-to-speech-to-text process, and send the text string in binary.

cezart · on May 9, 2018

Well they could at least continue doing the same thing, at 100x the speed.

scrollaway · on May 9, 2018

Over the phone? Voice recognition would fail even more. Google also mentioned some of the latency in answering is required by their processing.

Piskvorrr · on May 10, 2018

Does this sound look familiar? ;)

https://www.youtube.com/watch?v=vvr9AMWEU-c

In other words, if M2M handshake works, switch away from voice.

stfwn · on May 8, 2018

A large part of introducing new tech is enabling the transition from existing tech. Maybe if we scrapped all old cars and allowed only autonomously vehicles we’d be be sleeping at 120 km/h next week. But some businesses still run on COBOL.

Tracist · on May 8, 2018

There are already apps/technologies that transmit information through audio at frequencies not audible to humans. It should be trivial to adapt this so that if two AI systems are interacting they can perform an "AI-handshake" in the audio at the start and then switch to a more efficient form of communication.

blt · on May 9, 2018

but audio telephony equipment generally assumes the signal is in the human vocal range, right?

NLips · on May 9, 2018

Correct. There are several levels at which this applies:

Phone hardware (microphones, speakers) are only calibrated to detect 'useful' frequencies for human speech.

The sampling rate used by audio codecs tend to cut off _before_ the human ear's limits e.g. at 8kHz or 16kHz. They aren't even trying to reproduce everything the ear can detect; just human speech to decent quality.

Codecs are optimized to make human speech inteligible. The person listening to you on the phone isn't receiving a complete waveform for the recorded frequency range. The signal has been compressed to reduce the bandwidth required, where the goal isn't e.g. lossless compression; it's decent quality speech after decompression.

It's completely possible to play tones alongside speech that we won't notice, but in the general case, not tones that the human ear can't detect.

nielsbot · on May 9, 2018

Google Duplex is just a backwards compatibility shim...

Piskvorrr · on May 10, 2018

Backwards compatibility is not bad, quite the opposite. There are far too many dead data silos, abandoned for a new, shiny and incompatible protocol.

redavni · on May 8, 2018

The person who invents a way to change the little green phone icon with a little human icon for when they want to talk to a human will be a zillionaire.

But that is the far future. Realistically, I just don't see this as feasible any time soon.

viraptor · on May 8, 2018

This doesn't even need to happen on the voice connection. A register of "does this number map to a known system" would be enough. Then it's just up to common APIs.

douglaswlance · on May 9, 2018

The issue here is interfacing with ancient telephone systems.

If the two bots were to slip in some subliminal beeps and boops to recognize each other; then they could change their speech to very quick binary communication.

i386 · on May 9, 2018

You mean AT codes.. sounds familiar

josefx · on May 9, 2018

You can't just drop compatibility. We will have A.I. trained voice systems that mimic natural speech just enough to be understood by duplex while compressing the exchange to a minimum. Data transfer will be measured in microwords per second. Future versions of duplex will of course detect this kind of compressed speech and reply in kind, falling back to normal speech only if the immediate response is similar to north american confusion.

zbentley · on May 9, 2018

...somewhere, Kevin Mitnick just smiled.

I like to think it was a smile of renewed relevance due to unbelievably poor technical decisions.

tomludus · on May 8, 2018

This really made me laugh. You are also totally right, we have better solutions to this problem already.

digi_owl · on May 9, 2018

I can already hear the modem handshake playing in my head...

real-hacker · on May 9, 2018

That form of communication exists. It's called REST.

nomy99 · on May 9, 2018

you mean the entire tcp stack and network to network communication?

punnerud · on May 8, 2018

To save CPU/GPU/TPU there should be a high-frequency sound, as in people can’t hear, so the computers talking to each other and switch to a faster way to communicate. If this is included you also have way to detect if you are talking to a bot/duplex.

Svenskunganka · on May 8, 2018

Doesn't most carriers heavily "compress" the sound, removing all sounds/frequencies that a human can't hear, etc? https://www.youtube.com/watch?v=w2A8q3XIhu0

tlrobinson · on May 8, 2018

Yes, but it could be very subtle and low bandwidth at first, and once both sides were convinced the other was a machine switch to a full speed screeching 56k modem [1].

Or just communicate "hey actually connect to this HTTP/XMPP/whatever address on the internet and we'll continue this from there"

1. Probably a bit slower, I've heard modern VoIP lines don't work well with traditional modems?

tabs_masterrace · on May 8, 2018

Damn that sounds even more dystopian, can you imaging it

"Hello how can I help you? - Hi, beep, I like to reserve a table? - Ok, beep, beep, on second - Mhm-mm beep, beep, sceech, 011000101010...."

positr0n · on May 8, 2018

Great comic on this: https://www.smbc-comics.com/?id=3576

TeMPOraL · on May 9, 2018

Also a great story along very similar lines:

http://archive.today/txrAd

(Archive link because that blog now requires authorization to view for some reason.)

haser_au · on May 9, 2018

This made my day

marpstar · on May 8, 2018

This is basically what 56k modems do, isn't it?

andrepd · on May 9, 2018

Indeed it is! https://www.windytan.com/2012/11/the-sound-of-dialup-picture...

estel · on May 8, 2018

The whole context of Google introducing this functionality was for the 60% of businesses that don't yet have an online presence at all.

tlrobinson · on May 8, 2018

Sure, but they should consider the future. That number will only get smaller, especially if Duplex or other services say "we'll handle all your phone and online bookings for you for $SMALL_FEE, and still forward other inquiries to your phone as before".

Piskvorrr · on May 10, 2018

[smallprint]...and we'll abruptly shut it down in 18 months.[/smallprint]

stefan_ · on May 8, 2018

So just put it in the hearable spectrum. Phones already make all kinds of sounds that no one under the age of 35 has any clue what they mean or why they are needed, and frankly they aren't.

avip · on May 8, 2018

Perfect use case for the endangered fax answering sound.

ambicapter · on May 8, 2018

beep boop.

sgt101 · on May 8, 2018

Yes, and lots of sounds that the human ear can hear but are not used to decode speech. Also the audio is frequently recoded as calls pass from infrastructure to infrastructure.

Good times !

danpalmer · on May 8, 2018

You’re totally right.

However while this is useful to bootstrap a new technology rollout, 10 years on its just technical debt.

The amount of tech debt in the system behind credit cards is crazy, because originally charges where phoned in to the card issuer manually, and everything from then on - magstripe, chip & PIN, online only transactions, etc, has all been built on top, and the leaky abstractions show through in daily difficulties with the card system for end users, like lack of real-time balance (in some cases), lack of transaction metadata, etc.

derefr · on May 8, 2018

On the other hand, the credit card system's backcompat does mean that you can still accept credit cards when the power's out. You just write down the number (or use an imprint machine) and let the customer go. And the semantics of credit mean that you can still make that charge even if an online transaction would have resulted in a decline—offline transactions are never declined, they just cause overdrafts.

zbentley · on May 9, 2018

I wonder if that resilience is worth the immense amount of infrastructure and engineering that is spent on maintenance of the technical debt. Does that maintenance drive up processing fees? I suspect it does, but not in any amount sufficient to explain the size of those fees.

s73v3r_ · on May 9, 2018

Yeah, but around here, most places just say the system is down and require cash.

danpalmer · on May 9, 2018

True! Although I think that's more of a byproduct, rather than something designed into the system at the moment, and I suspect we could do better with designing it. For example, I doubt many shops have those imprint machines any more.

I also think the tech debt is holding us back a long way. For example, why can't I see itemised receipts in my card statement? Paper receipts are on their way out, email receipts aren't linked to anything or structured data, but being able to see that I've spent $120 on shipping with Amazon in the last 12 months, so a Prime subscription would make sense, would be a great sort of financial tool to have. That isn't possible in the card network at the moment.

Piskvorrr · on May 10, 2018

They imprint "machines" are still issued - but all* of them are just tossed away into storage, and never ever used (Training users on those? Pointless).

nneonneo · on May 8, 2018

Instead of a high frequency tone, just watermark the background noise or the speech pattern. You could watermark the background static, the voice samples, or even the speech patterns. All you really need is something like 30 bits of data to identify a call as a Duplex call with very high probability, and I’m certain you can find a way to imprint that many bits into the frequency spectrum of your background noise.

ngold · on May 8, 2018

I like this. So basically the old school modem sound, but in frequency that can't be heard. It would only take a fraction of a second to send out the feeler, and would not be noticed if a live human picked up. Could even detect a human and send the call over to a live representative without anyone noticing.

monkeynotes · on May 8, 2018

It doesn't have to be out of frequency (since that's probably filtered anyway), could be just a really quick burst handshake identifier which could encode an IP address to communicate over instead of a crummy phone line.

Duplex: <beep beep> (I'm available to chat)

Other bot: <boop boop> (Oh hai! Wanna get intimate?)

Duplex: <blaaaaaaaart> (Come find me on duplex://64.233.160.0)

marv3lls · on May 8, 2018

jamesgeck0 · on May 8, 2018

As an end user, picking up the phone to hear a beep is not pleasant. I'm likely as not to immediately hang up, as I've come to associate beeps at the start of calls with scammers.

monkeynotes · on May 8, 2018

What about if the caller makes no such sounds and the recipient makes the offer to handshake?

Anyways, this aspect is more amusing to just think about than anything else. That said, I really hope companies who produce these next-gen AI robo-callers actually have the courtesy of identifying themselves as such. I want to know if I am talking to a human or Duplex. Yes, I may hang up, but I feel uncomfortable being fooled into thinking I am talking to a human when I am not.

sangnoir · on May 8, 2018

There's no reason why it can't be encoded as elevator music - you already hear it all the time, they might even throw in a looping "Thank you for calling. Your call is important to us" to keep you from freaking out.

macintux · on May 8, 2018

Hence the idea of doing it in a frequency humans can't hear.

tlrobinson · on May 8, 2018

Hence the:

> (since that's probably filtered anyway)

Phone lines are optimized for frequencies humans can hear, though I'm guessing you could get enough bandwidth out of the edges to convince the other side you're a machine without bothering a human too much.

projektfu · on May 9, 2018

No, let's make it fun. "How can I help you?" "Get off the phone you damn bot!" "Why I oughta...<modem sounds>"

lukeqsee · on May 8, 2018

Did we just reinvent dialup?

sxates · on May 8, 2018

Human-readable modem handshake.

insensible · on May 9, 2018

I want to hear a Duplex voice giving verbal AT commands.

Piskvorrr · on May 10, 2018

+++ATH0

tlrobinson · on May 8, 2018

Yes, it would be nice if in parallel Google came up with open machine-friendly protocols for each of the use-cases Duplex supported, with a clear migration path away (e.x. businesses started publishing the endpoints and protocols they supported alongside their phone number so you could skip the call completely)

arikrak · on May 8, 2018

Or a way to book via a website or app...

Thespian2 · on May 8, 2018

"Ba weep granna weep ninny bong"

It is the universal greeting for cybernetic organisms, after all.

csomar · on May 8, 2018

maybe just cut the chase and do an API for reservation?

Tsagadai · on May 9, 2018

Why not just a pattern of umms and arrs that it already seems to add into output. Easier to detect for it and harder for a human to recognise.

justonepost · on May 8, 2018

The problem is that computers will now be interacting with people and people will become unsure of whether they are taking to a computer or a person. It will create a bewildering world full of mistrust. I would argue that there should be a law proclaiming that computers must identify themselves.

floatrock · on May 8, 2018

They're not unaware of this concern. From the article:

> The Google Duplex technology is built to sound natural, to make the conversation experience comfortable. It’s important to us that users and businesses have a good experience with this service, and transparency is a key part of that. We want to be clear about the intent of the call so businesses understand the context. We’ll be experimenting with the right approach over the coming months.

xg15 · on May 8, 2018

So they indicate that they are aware of the problem - and instead of doing a straight-forward "hey, I'm a bot", their suggested strategies are "being clear about the intent of the call" and "experiment with the right approach over the coming months"?

To me that quote sounds more like a polite way of saying they definitely won't reveal to callers that they are talking with a bot than them taking the concern seriously.

Some of the conversation examples on the blog page where they invent a sort of story for the caller ("I'm calling for a client") would fit that theory.

mauriciob · on May 9, 2018

People are prejudiced against talking to bots because of how bad they are currently. I despise calling services that have voice recognition, often times it's easier to have an options menu.

If they can get a way of saying "I'm a bot" without people hanging up the calls, I'm all for it -- otherwise, "I'm calling for a client" or similar is the best for everyone involved (assuming everything works).

Businesses also need to have a way to report problems to Google, like if they are getting spammed by Duplex or want to opt out.

nfoz · on May 9, 2018

> People are prejudiced against talking to bots because of how bad they are currently

I'm prejudiced against talking to bots because they're bots. They don't have empathy, whereas from voice interaction I expect a human I can relate to and desire to help and be courteous with. It's a fundamentally different type of interaction and I will be annoyed anytime that one is confused for the other.

Piskvorrr · on May 10, 2018

Well, bad news: all the (presumed) humans I've talked to on any scripted call are worse than bots: they do have empathy, but the script forbids them to use it. An out-and-out bot is free from this prison, at the very least.

annabellish · on May 9, 2018

How about something like "I'm an automated agent calling for a client" - correct, not misleading, but using terminology which isn't likely to be immediately disconnected _right now_.

Of course, if they screw it up, they'll burn that terminology too.

obsurveyor · on May 10, 2018

Yeah, something like "Hello, I'm Rogers' Assistant..." to fit current Google naming conventions would probably be fine.

Piskvorrr · on May 10, 2018

From the samples I've heard, it's trying to sound natural. Hello Uncanny valley, long time no hear.

mrastro · on May 9, 2018

If we accept your premise that bots become so close to humans that they are virtually indistinguishable, then at that point does it matter who the "person" on the other end of the line is? I'd argue it doesn't because the outcome would be the same.

nfoz · on May 9, 2018

I think there will always be a difference. When I'm talking to someone, there's an emotional connection and responsiveness there. I'm trying to help a human out -- I'm putting effort into being polite, into considering their point of view etc.

If I found out that they were a robot (this is probably unpreventable; even if the technology gets amazing, surely there will be edge-case breakdowns/bugs/etc.), my trust is broken. That would have an emotional consequence e.g. frustration.

There will always be amazing technology wielded by awful developers, and in this case the outcome is emotionally hazardous. The impact of that is not easy to quantify e.g. by any economic indicators, but it's there.

Also, it's likely that robots will not be as polite back, so we're degrading society's trust and empathy all around. For example, Google's AI call to a restaurant was rude, and not for reasons it seems to yet understand.

mb_72 · on May 9, 2018

What happens then if I'm a human that masquerades as a computer? Seems like a neat way of explaining away a number of social faux pas or drunk-dialling. "Me? Noooo... that must have been my PhoneBot 2000. Supposedly the next firmware update should solve that kind of problem."

That said, I don't necessarily disagree; there is going to need to be lots of these kind of issues that need sorting out before we reach a Culture-level of AI interaction.

tim333 · on May 9, 2018

Phoning like this you already have to wonder if the person on the end is an idiot who will screw stuff up. I'm not sure it maybe being a google bot will make much difference.

philwelch · on May 9, 2018

If you can't tell whether you're talking to a computer, does it matter?

avip · on May 8, 2018

Not sure why you're using future tense here.

empath75 · on May 8, 2018

I love how no matter how amazing something is, someone will eventually say it's trivial, even though it's taken the smartest people on earth decades to figure out how to do this.

>The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex.

The end game is clearly to use an api, not this.

avinium · on May 8, 2018

If you ever needed proof of the negativity in the tech community, this thread is it.

Google actually delivers a real world product incorporating the most advanced AI we've had the chance to experience, and half the HackerNews comments are "Wow, this is so dumb, can't wait for it to become technical debt in 10 years".

zbentley · on May 9, 2018

One time, a giant tech company staffed by geniuses released a super-useful tool that saw massive adoption. 10 years later that tool became technical debt.

Wait, it was way less than 10 years.

I'm talking about Google Realtime. Or reader? Or buzz?

No, wait, I'm talking about aggressive Twitter API deprecation/removal.

Wait, nevermind, I'm talking about Facebook.

You get the idea. What's revolutionary today sometimes becomes the substrate for future innovation. Sometimes it gets cast by the wayside, even in the face of significant "user" (developer) popularity.

That's not proof of negativity; just realism. Negativity would be "no new innovation will ever get traction". Optimism would be "all new technologies will change the world" (c.f. https://www.npmjs.com/browse/depended). This is neither.

TeMPOraL · on May 9, 2018

Because idea of running machine2machine communication through Duplex, going binary -> text -> voice -> text -> binary, is just fucking dumb.

It's not proof of negativity in the tech community. If anything, it's a proof that tech community often can't pause and look if a particular idea makes engineering sense.

That's how we ended up with Electron.

floatrock · on May 8, 2018

> The end game is clearly to use an api, not this.

My understanding is API integration is what wechat is in china -- every hair salon and equivalent-of-corner-pizza shop has some wechat integration, payment and all.

Voice bots like this will have the advantage of ubiquity. At least a couple years ago before every resteraunt had 5 tablets for all their seamless/grubhub/chowhound/whatever apps, pretty much the only reason the fax machine was still around was for restaurant ordering. Although there were clearly better ways of doing it (see how Dominos reinvented itself as a tech company), the sheer ubiquity of fax as the lowest common denominator kept the tech around.

In that light, it's kinda like the cell-phones-leapfrogging-landlines-in-developing-countries argument... part of the wechat story involves a massive population entering the consumer class at a time when everything was digital. Call me out if this is a gross over-generalization, but in a way, the wechat population never had to deal with the backwards-compatibility of people growing up ordering a pizza over the phone.

It'll be interesting to see how the API-centric approach (wechat) plays out versus the lowest-common-denominator ubiquity approach (voicebots). I'd stop short of calling API's the end game though.

ZainRiz · on May 9, 2018

Also, in the WeChat model, everyone is tied to WeChat and can't go around it.

This voice based model can be integrated into any existing system. It already has the network effect going for it and it's not tied to the fate of any one company

TeMPOraL · on May 9, 2018

No, you just become tied to Google. You can't reimplement Duplex yourself without reimplementing both their API and their voice recognition verbatim, and the best way to do that is to simply use their product.

ZainRiz · on May 9, 2018

Not necessarily. Given the rate of progress in AI and the number of companies working on it, it's only a matter of time until Duplex-like tech is reimplemented by other large corps like Amazon and Microsoft, and eventually it'll could even be implemented by startups if there is a decent business case for it

booleandilemma · on May 8, 2018

My understanding is API integration is what wechat is in china -- every hair salon and equivalent-of-corner-pizza shop has some wechat integration, payment and all.

Meanwhile in NYC, good luck getting the bodega on the corner to even take your debit card.

eric_h · on May 8, 2018

Eh? The vast majority of bodegas in NYC take credit cards. Many have minimums, or charge a fee if you don't hit the minimum. But I've not been to a bodega in the last ~5 years that didn't accept credit cards in one form or another. > 5 years ago, sure, but now pretty much everyone's got them (even in the rougher neighborhoods).

If they don't accept cards, they almost always have an ATM.

My complaint in NYC is the uptick of "cashless" places, that don't accept legal US tender. I like using cash, I don't want it to go away.

monkeydust · on May 9, 2018

Agree api is the end goal and the wechat model is amazing.

ryanwaggoner · on May 8, 2018

pretty much the only reason the fax machine was still around was for restaurant ordering

The fax machine is still around now, and heavily used in the medical context: https://www.vox.com/health-care/2017/10/30/16228054/american...

PascLeRasc · on May 8, 2018

Please go and convince the mom and pop hardware store or bakery in your town to use an API instead of answering the phone.

empath75 · on May 9, 2018

They’ll do that before getting a robot to answer the phone for them.

fwip · on May 9, 2018

Why do you say that?

empath75 · on May 9, 2018

Because they do it all the time. Open table already solved this problem for 90 percent of restaurants.

alonmower · on May 8, 2018

Ya, I agree that an API is the right solution but the benefit of this is that both sides aren't forced to adopt it at the same time, it's more resilient to changes on either side...

badprose · on May 8, 2018

The benefit is that a human could jump in and take over either side of the conversation.

Anyone can have a conversation, but not everyone can author an API.

cromwellian · on May 8, 2018

Where did I say it’s trivial? What I’m saying is, people ar anthropomorphizing and extrapolating what this system is doing far above what it is actually doing, and then using it to justify fears that Skynet’s around the corner.

This system can’t pass the Turing test, it would be fooled probably by a simple question about itself or a subject outside the domain, like the kind of food you like.

You’ve got people in this thread hyperventilating about AI duping your voice and the. Becoming a doppelgänger and therefore we need laws immediately to stop this dystopia? Let’s calm your cortisol levels for a second and stop acting like Thanos just got the last gem.

steve_musk · on May 8, 2018

I don't think any technically minded people on HN are extrapolating what this system is capable of doing, but are (rightly IMO) extrapolating what kind of systems will be announced in 2, 5, 10, etc. I think even HN is greatly underestimating what world class researchers paired with an army of world class engineering talent are capable of.

empath75 · on May 9, 2018

I think it passes a limited Turing test at the domain it’s trained in. I doubt any of the people on the other end of the call would even suspect it’s a computer. That’s an amazing achievement.

ErikAugust · on May 8, 2018

"Duplex seems trained against this corpus. The end game would be for the business to run something like duplex on the other side, and you’d have duplex talking to duplex."

So, you would just have an automated booking system API which is better handled by not placing calls as its form of communication. Right?

Drakim · on May 8, 2018

The point being is that it's a human speech API which is easier for humans to utilize "manually" if they don't have a robot servant.

aurailious · on May 9, 2018

I prefer not to call them servants, it would be better to start calling them partners actually, just in case we call them master some day.

Kalium · on May 8, 2018

Yes!

This is an API that requires no computer on the user's end and is portable across different implementations from different companies.

It's not ideal. Actual standardized APIs are better. But, uh, have you ever worked with industry standard APIs? I have, and standardized is not how I would describe them.

imnotadoctor999 · on May 9, 2018

I still think there's a need for standardized APIs in this situation. At some point, the context constraints mentioned in the blog post have to get translated into some action with parameters. I'm guessing that action will be API calls to other Google products behind the Duplex Google Assistant UX.

"Ok Google, can you reschedule my Dr. Appointment this Friday for next week? I have a conflict." -> calls the Dr and reschedules -> adapts result to rebooking action with partners (ie, an api call to your Google calendar) -> applies action and responds to you.

There is still quite a bit missing from this to be a useful AI product. It's getting really close though. I can't wait until this makes it into Google Assistant and it can call a restaurant to ask about gluten free options while I'm driving.

Kalium · on May 9, 2018

You're absolutely right! There's a huge need for standardized APIs for interacting with outside systems for this use-case. In your example, your doctor's office.

In practical terms, there may be some minor issues such as incompatible multiple implementations and adoption costs. But that's made much easier to handle by a very small number of expected consumer systems.

As for interactions with end-result partners, well. I've worked with standards designed to represent such highly general cases (xcbl and cxml). They're invariably rife with interoperability problems and other issues arising from overly broad standards. These tend to not get better over time as much as one might hope, as it's not easy to continuously update standards at a reasonable speed across N target types of partners. Keeping up with how usage evolves is never easy.

The best approaches to this that I've seen in use are those that focus on providing a vehicle for arbitrary data for delivery to the app - like HTTP or TCP. Getting more specific is the route to madness. Which, unfortunately, is probably precisely the bit you'd most like standards around.

You're completely right. There's a very real and very important need for standards here. There just might be some issues worth mentioning that might arise from the attempt to create and rely on them.

ZainRiz · on May 9, 2018

It's not just about the user. There's no open, universally agreed upon booking system API...except for human voice.

This is creating a natural-language based booking API that any system or business can tap into

hrasyid · on May 8, 2018

This is my initial reaction too, but APIs need work for coding, integration, testing, make sure data is sent in the right format, etc, while voice-robot-to-voice-robot will just work out of the box.

jdelman · on May 8, 2018

You're completely abstracting away all the coding, integration, testing, and "data formatting" (read: grammar) involved in Duplex, which seems to be much more complex than an REST API.

hrasyid · on May 11, 2018

But each side only need to do it once right? At least once per language. No need to deal with interoperability with multiple APIs and integrations.

lallysingh · on May 8, 2018

Opentable

smueller1234 · on May 9, 2018

And NASDAQ:BKNG in general (includes OpenTable).

seanp2k2 · on May 8, 2018

That end game seems very wasteful and Rube Goldbergian. Why use an 1800s technology as the transport layer when salons, yoga studios, and more already use things like MindBody, which already has an appointments API? I’d honestly be way more interested if this integrated with MindBody, OpenTable, DMV websites, car dealer appointment systems, medical office scheduling systems — all of which already have APIs or at least web pages. But then, saying that you wrote and will maintain some WWW mechanize stuff that posts forms is way less marketable to the general population who see this as magic.

Also, they’ll discontinue it after a year once it gets enough negative press about how it doesn’t work well and loses business for businesses.

fantispug · on May 8, 2018

Because there are lots of businesses that don't have a booking API and don't see the need for one, or can't afford one. This kind of technology allows interaction with them, because it's easier to interact on a common transport protocol than to expect everyone to change to your preferred one. That being said I feel like it won't be long until this tech is used for scamming, phishing and pranks.

tim333 · on May 9, 2018

In the second example they ask what's the line likely to be like on Wednesday. Show me one API that does that. English is richer than APIs.

nathan_f77 · on May 9, 2018

Google does provide that information when you look up a restaurant, under "Popular times": https://imgur.com/a/ssdHxn3