How Alexa knows “peanut butter” is one shopping-list item, not two

prions · on Dec 18, 2018

I implemented a similar BiLSTM-CRF model at my current job. The architecture itself is really interesting, but runs into scaling issues. With LSTMs, you run into the constraint of having to wait on previous inputs and cache those results as well. Although TensorFlow now offers Cuda RNNS and fused kernels to speed up computation, I'd have thought for Amazon's scale, an attention/transformer based architcture would serve them better.

I also notice a lot of dismissive comments about "black box models" or the simple solutions of just parsing out whitespace. My two cents:

1. Models with hand crafted rules perform WORSE than learned representations, especially when you have an end-to-end model with pre trained embeddings. This is shown by one of the seminal papers on this model, Ma and Hovy (2016) https://arxiv.org/pdf/1603.01354.pdf.

"However, even systems that have utilized dis-tributed representations as inputs have used theseto augment, rather than replace, hand-crafted fea-tures (e.g. word spelling and capitalization pat-terns). Their performance drops rapidly when themodels solely depend on neural embeddings"

2. Human speech and human written text are messy. Having a rule for human speech will inevitably lead to a massive list of rules and exceptions to those rules.

3. This model is multi domain, meaning that you don't just need rules for one domain, but rules for multiple domains and interactions between those domains. Considering Amazon's hefty amount of data, it's much more efficient to learn these represntations though a machine learning model rather than constantly playing cat-and-mouse with keeping your hand crafted rules.

L2R · on Dec 18, 2018

Spot on. Just a heads up, there is a decent amount of work on using convolutions to condense the initial representations and can reduce computation time equal to your max pooling. A lot of these tasks can be done via hyperparameter search over CNNs, so you can easily reach parity using a CNN-LSTM approach w/ the same number of parameters.

NicoJuicy · on Dec 18, 2018

In text parsing, this whole machine learning would be implemented as "ignore white space as delimiter" ( that's how I implemented it on one of my projects)

Ps. I'm aware this will not be a popular comment

jchw · on Dec 18, 2018

Sentences spoken aloud don't have commas.

coldtea · on Dec 18, 2018

You don't need commas to do that in text parsing either. Newlines, for example, will do.

When spoken, a shopping list is not a sentence. There's a small pause and/or different emphasis on the start of each item that can be learned (humans, for one, can discern it).

"Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter".

(Besides it can easily learn that peanut, singular, is not a thing people order: it's either "peanuts" or "peanut-butter" etc).

halflings · on Dec 19, 2018

You're basically saying that they should build a speech recognition system that links words that should belong together with a hyphen... great then: that's exactly what this article is about.

Dylan16807 · on Dec 19, 2018

No. They're not saying anything about "words that should belong together". They're saying two things:

1) There is a pronunciation difference between one item with two words, and two items with one word each.

2) You can also use per-word information here, because "peanut" is not something that goes on shopping lists.

coldtea · on Dec 19, 2018

No, I'm just using a hyphen on my "transcription" to show the two cases are spoken differently.

cr0sh · on Dec 18, 2018

> "Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter"

True, but I've never went to the grocery store to buy a singular peanut.

I've bought a bag of peanuts, but not just one peanut.

cecilpl2 · on Dec 18, 2018

Eggs milk avocado-oil

Eggs milk cheese-buns

Eggs milk chicken-salad

avip · on Dec 18, 2018

  pudding pudding pudding applesauce
  pudding pudding pudding applesauce
  pudding pudding pudding applesauce
  applesauce applesauce applesauce

coldtea · on Dec 19, 2018

Which I cover directly below.

gok · on Dec 18, 2018

Or pauses between words

sdinsn · on Dec 18, 2018

> In text parsing

jchw · on Dec 18, 2018

Are we comparing this to what Alexa is doing or not?

8bitsrule · on Dec 18, 2018

OK, bullets then. At any rate ... a one-line of code solution is a flimsy excuse for a product placement.

jchw · on Dec 18, 2018

I don't understand. You realize the speech recognition algorithm gives you raw words with no punctuation, right?

Analog24 · on Dec 18, 2018

What single line of code could solve this problem?

bitwize · on Dec 18, 2018

Yes, they do. There's evidence to suggest that punctuation marks were devised as pronunciation guides, indicating how to inflect and when to pause, rather than syntactic markers in their own right. Commas in particular indicate a distinctive inflection and short pause in speaking; such would be detectable by Alexa especially if it uses a neural net or similar to analyze human speech.

jchw · on Dec 18, 2018

That was not the point of what I said at all. It Could interpret the vocal cues, yes. As far as I know this is still not a solved problem for speech to text, and results going the other way and trying to guess punctuation is still more reliable.

Back to what I really was getting at: I'm pretty sure the person I replied to was suggesting Alexa could just split(',') and call it a day. With text, yes. With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly. I am certain humans use a mix of vocal cues and interpretation to place the commas in their heads.

NicoJuicy · on Dec 18, 2018

No, i'm suggesting:

- ignore the comma, point and space as delimiter and compare the values / entities against a dictionary for neighbouring words.

- don't ignore the comma and compare the values against a dictionary.

Put a priority ( or in machine learning terms: a classifier) on both outcomes, because comma is not reliable in spoken language. So that it would interprit peanuts butter as [peanuts, butter] and peanut butter as [peanut butter].

PS. Now i hope that text-to-speech translates spoken: peanuts correctly to [peanuts] and not [peanut], because that would fail.

PS2. The article itselve doesn't mention the punctuation problem

jchw · on Dec 18, 2018

>PS2. The article itselve doesn't mention the punctuation problem

It doesn't go into detail but it does seem to mention it.

>Off-the-shelf broad parsers are intended to detect coordination structures, but they are often trained on written text with correct punctuation. Automatic speech recognition (ASR) outputs, by contrast, often lack punctuation, and spoken language has different syntactic patterns than written language.

coldtea · on Dec 18, 2018

>As far as I know this is still not a solved problem for speech to text

Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.

With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly.

That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.

jchw · on Dec 18, 2018

>Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.

Okay... But Alexa isn't just shopping lists. You only know you are dealing with a shopping list after parsing the text.

Even if you did go back, is the narrower use case any more solved than the general one? Guessing with text alone turns out to be fairly accurate and so even if you could do this decently, it would have to be notably better to be worth the trouble.

>That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.

Irrelevant to this though, considering that problem has mostly been solved at this juncture.

smoser · on Dec 18, 2018

Amazon is pulling ahead. Siri still adds 'All man milk' to my shopping list instead of almond milk.

kccqzy · on Dec 19, 2018

And someone asking Siri to set a reminder to "prepare the dough" only to find that Siri heard something much more morbid.

https://twitter.com/danielpunkass/status/1073723442179031041

reasonablemann · on Dec 18, 2018

Lately Siri gives me an error when setting the timer. But upon inspection has actually set the timer correctly. It's equally infuriating and comical.

tacomonstrous · on Dec 18, 2018

Siri has been behind Google and Alexa basically since the beginning and shows little sign of catching up.

master_yoda_1 · on Dec 18, 2018

This article say it all https://www.theinformation.com/articles/the-seven-year-itch-...

chamanbuga · on Dec 18, 2018

This is surprising, I didn't know Alexa was capable of this. Whenever I say "Alexa, add milk, eggs, bread, & laundry detergent to my shopping list" it shows up as one long sentence in a single entry.

BookmarkSaver · on Dec 18, 2018

It's a fairly new feature.

eesmith · on Dec 18, 2018

What about "Alexa, add fork handles to the shopping list"? https://www.youtube.com/watch?v=gi_6SaqVQSw .

jcranendonk · on Dec 18, 2018

You never know when you might need some new fork handles.

quirkot · on Dec 18, 2018

This only a surprising thing because a human would parse the waveform as two words and then back into a single concept. A computer could parse it as a single concept directly or in syllable lengthed chunks to be reconfigured however.

stan_rogers · on Dec 18, 2018

A human really wouldn't. It's the same thing as telling the difference between black bird and blackbird in running speech; peanut butter may be spelled as two words, but it's spoken as a single thing. If the concept was new enough that you were consciously talking about a type of butter made from (of all things) peanuts, it would be a black bird vocal entity rather than a blackbird one.

quirkot · on Dec 18, 2018

You're still talking about words, but the issue is waveform sound. An english speaker would hear "black bird" and use rhythm and context clues to parse it into words and then into the appropriate idea. A non-english speaker might hear something like "blekabirt" as the waveform sound is interpreted through their particular linguistic habits, then it would be rejected as a non-word sound

akira2501 · on Dec 18, 2018

It's true that we do alter our speech to provide context clues, it is also true that without them we're _still_ capable of piecing them together.

If someone says in a unnaturally drawn out way "I like peanut butter sandwhiches" then I will have no problem detecting the situation and then re-parsing it correctly.

behringer · on Dec 18, 2018

The space is irrelevant. Consider the sentences:

The black bird ate seeds.

The blackbird flew at mach 3.

Your brain thinks of these two words completely differently and it's only through conscious effort that you think of them together. They are different words even though they sound and are spelled the same, regardless of the space.

A better example I think is "bear feet" vs "bare feet"

PhasmaFelis · on Dec 18, 2018

"Blackbird" refers to several species of actual bird, not just the plane. "The black bird ate seeds" and "The blackbird ate seeds" are both reasonable sentences, and they do potentially sound different.

xioxox · on Dec 18, 2018

I'm not convinced. It takes effort for me to break apart blackbird into two separate words in my head, as they are so commonly found together. When speaking "black bird" I would insert a long pause between the two and emphasise the "b" on bird to show that I'm not talking about a "blackbird".

qlm · on Dec 18, 2018

The difference for me (maybe this is regional?) is that for “blackbird” the stress is on “black” whereas for “black bird” the stress is on “bird”.

stan_rogers · on Dec 18, 2018

Yes, it's mostly about the stress.

theoh · on Dec 18, 2018

I get the impression that US English pronunciation runs the words "peanut butter" together much more that "international" English does.

NeonVice · on Dec 18, 2018

I recently heard about a book called "Anything You Want" and wanted to add it to my shopping list. I tried repeatedly saying "Alexa, add 'Anything You Want' book to my shopping list" and every variation thereof. Alexa was unable to add that book title no matter what I tried.

munificent · on Dec 18, 2018

This seems like a fundamentally hard problem. If I ask Alexa, "Play songs by Simon and Garfunkel", I may want to include their solo work ("Play songs by ([Paul] Simon) and ([Art] Garfunkel)") or not ("Play songs by (Simon and Garfunkel)"). The choice is probably more likely for some artist groups than others. It may even vary by user. It's hard to imagine a single trained AI that can handle that variance without a ton of very quickly-changing domain knowledge.

exogen · on Dec 18, 2018

In my experience doing stuff like this for artist/song record linkage, the key is really to take a "query expansion" approach rather than a "normalization" approach, because choosing a single normalized form is impossible. So it's better to embrace that there are dozens, hundreds, or even thousands of interpretations and choose probabilistically.

A great example is trying to deal with the "sort name" of artists: e.g. "Presley, Elvis".

It's easy to assume that "Hazlewood, Lee & Nancy Sinatra" means "Lee Hazlewood & Nancy Sinatra".

How bout "Sinatra, Frank & Nancy"? Now the rules are different: the expansion could either be "Frank Sinatra & Nancy Sinatra" (correct) or "Frank Sinatra & Nancy" (but there's no singer who just goes by "Nancy", or is there?)

Now how about "Peter, Paul & Mary"? In that case it's already the literal expanded form referencing three people, not two people named "Paul Peter & Mary Peter" or "Paul Peter & Mary".

So, you just assume they are all possible and rank them based on real-world data. You're right, not always easy!

(Treating them as an unordered bag of tokens can either help or hurt accuracy – that has its own problems when you consider how short and similar many titles are, and how some artists deliberately name themselves as jokes/riffs on a more famous one. Not to mention that after all this it could still be ambigous: MusicBrainz knows about six artists all named "Nirvana". So context is key!)

statictype · on Dec 18, 2018

I presume they will statistically choose the more likely option - ie, people listen to 'Simon and Garfunkel' more than they listen to 'Simon' or 'Garfunkel'.

Humans will also screw this up. They don't have statistics about which you most likely meant but they do have context which an AI may not.

baybal2 · on Dec 18, 2018

This, yet again shows just how far the computer science is from real natural language processing, despite of all "AI" companies' claims.

Unless they have all and everything hardcoded, even such natural thing are impossible to process for the "natural" language processing programs.

All cloud "AI" and "natural" language processing services should really be called "lots and lots of hardcoded stuff language processing"

blattimwind · on Dec 18, 2018

State of the art virtual assistants offer little intelligence over a command line interface, except instead of typing the command line in, you say it. Besides that, not much difference; the syntax is rigid and the computer doesn't understand your utterance more intelligently than GCC understands "gcc -o my_prog my_prog.c".

yters · on Dec 18, 2018

It would be interesting if AI assistants were described honestly in this manner as a "spoken command line interface" instead of as equivalent to human speech recognition.

AI is mechanical turks all the way down.

avereveard · on Dec 18, 2018

And even with a rigid structure they only work for common cases. Using Siri in Italian and trying to get an English album playing is night impossible and vice versa.

kevin_thibedeau · on Dec 18, 2018

My girlfriend routinely asks questions with convoluted grammar and gets a relevant response much of the time. I'm always amazed when it happens.

taeric · on Dec 18, 2018

Examples? I seem to constantly get nothing from simple grammar questions. :(

PascLeRasc · on Dec 18, 2018

I use a Google Home, not Alexa, but I ask it oddly-worded things all the time. Here's two from the past week that worked: "Can you see if Reply All has a new episode and if so play it?" "Can I still use these green peppers I got two weeks ago?"

taeric · on Dec 18, 2018

That first one sounds promising. I'll have to see if I can adapt it.

My devices are basically just gateways to audible, radio, and general timers. I have begun using the announcement features, but it is amusing to see the kids basically having announcement wars.

Rebelgecko · on Dec 18, 2018

Alexa's shopping list definitely has room for improvement if it wants to be more natural. If I want to add bread and peanut butter to my shopping list, I need to say "Alexa, add peanut butter to my shopping list <pause> Alexa, add bread to my shopping list". If I say "Alexa, add bread and peanut butter to my shopping list", it adds a single item called "bread and peanut butter"

gok · on Dec 18, 2018

That is literally the opposite of what this article says.

master_yoda_1 · on Dec 18, 2018

How the model is adversarial? Also the best configuration is already found 2 year back in an ACL paper http://www.aclweb.org/anthology/N/N16/N16-1030.pdf. Is not it cheating when you claim somebody else result as your own. The industry is full of fraudster now a days.

cphoover · on Dec 18, 2018

Because who orders a single peanut.

pboutros · on Dec 18, 2018

immediately think of Fezzik (sp.?) in The Princess Bride

https://i.kym-cdn.com/entries/icons/original/000/022/158/Pea...

entity345 · on Dec 18, 2018

Well this is not the key indicator. The key indicator is the absence of pause between the words, that's how we differentiate between "peanut butter" and "peanut, butter".

IshKebab · on Dec 18, 2018

Far more important than the lack of a pause between then words is the a priori fact that "peanut butter" is a common single item and "peanut, butter" is an uncommon list of items. It is that fact that means that you require a pause between the words to indicate "peanut, butter".

If you ordered "butter, peanuts" for example, it would probably get that it was two items even without the pause between words.

It's all about the prior probabilities.

adrianmonk · on Dec 18, 2018

I don't see why they can't both be important signals. I would hazard a guess that a combined approach is what humans do.

I'm not a linguist or anything, but it seems like in practice people may pronounce "peanut butter" a little differently when they say the two words together. Something like "peanubutter". Or maybe they convert the "t" in "peanut" into a glottal stop.

Anyway, if the "t" is absent when you're talking about peanut butter but present when you're talking about two separate items, I don't see why you shouldn't feel free to use that signal. But I also don't see why you shouldn't use probabilities as well.

hathawsh · on Dec 18, 2018

I agree. When parsing speech, we humans listen for many cues all at once. Spoken language even has intentional redundancy so we can identify and disregard inconsistent cues. For example, a child or foreign speaker might replace "peanut" with "peanuts" and most people would still have no problem understanding "peanuts butter" as long as the rest of the cues are consistent.

IshKebab · on Dec 20, 2018

They are both signals. But the prior probability is more important in this case because it is so much more common to say "peanut butter" than "peanut, butter". In other cases the pause might be a more important signal.

entity345 · on Dec 18, 2018

I would disagree here. Especially in the context of a shopping list.

If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me to ask for confirmation because that would be enough to create doubt.

'Peanut' and 'butter' are two unrelated words and it is the absence of pause that creates a pseudo single word.

Pauses are extremely important in spoken language and should be exploited.

ghaff · on Dec 18, 2018

>If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me

I'm on the parent's side on this. "Peanut butter" is such a common item that if someone were to pause in between the words I would assume they just got distracted for some reason. In the context of a shopping list, "peanut" singular just doesn't make sense.

A better example would be something like "yogurt ice cream" which is technically incorrect but it's still something people might say. In that case, I'd expect a shorter pause than in the case of yogurt, ice cream. However, if you were dictating a list to me I'd probably ask for confirmation in either case because there's enough ambiguity.

mamon · on Dec 19, 2018

Deliberate pause sounds a bit different than a pause caused by distraction.

jonathankoren · on Dec 18, 2018

If we’re assuming human level AI, maybe and not even then. Nonnative English speakers from languages without a singular/plural distinction often make use singulars where a plural is called for.

Domain knowledge around the intent (making a shopping list) and around language conventions helps a lot, not only for NLU, but also for NLS.

But even without a huge NLU dictionary, a simple MC or CRF segmentation model would work. As said earlier, almost no one says., “Add peanut to the shopping list.” They say, “Add peanuts”, of if they’re very atypical, “Add a peanut”. A bare “peanut” is simply unlikely.

taeric · on Dec 18, 2018

I think the catch is that human detection of pauses is far more fraught with difficulty than you'd imagine. Overly emphasized pauses are easy to discern, but quite a lot of speech pulls words far closer together than you would think. And this is just in english. Not all languages work with emphasis syllables and such.

_emacsomancer_ · on Dec 18, 2018

Exactly, and the difference of the length of pauses between 'syntactic' words (peanut, butter) and between compounded words (peanut-butter) will vary based on speaker and moreover, things like focus/emphasis can cause lengthening of pauses even for the latter case. ("No, Alexa, I'm not shopping for Whistler's Mother; I need to buy PEANUT BUTTER.")

chrisfosterelli · on Dec 18, 2018

A better example is "coffee cream" (Google fails on this one)

pbhjpbhj · on Dec 18, 2018

Well, what is it?

I imagine lists with post-specifiers are near impossible to parse too:

    Coffee, instant
    Custard, powdered
    Sugar

Bit contrived, but still.

chrisfosterelli · on Dec 18, 2018

I see your point for longer lists, but most humans would correctly interpret "Hey, add coffee cream to my shopping list" as one item rather than two.

orf · on Dec 18, 2018

Assuming most cultures have "coffee cream" in their common vernacular seems risky. I've never heard of it.

pbhjpbhj · on Dec 18, 2018

This. I'd assume it's like "creamer" which is a fake milk substitute from USA I think? But, it could be coffee flavoured cream; or "coffee creams" which are biscuits in the UK, or maybe some sort of confection?

Yeah, I should just look it up ...

chrisfosterelli · on Dec 18, 2018

"coffee cream" AKA "half and half" in much of the US or "half cream" in the UK

pbhjpbhj · on Dec 18, 2018

I (native UK) have never heard of "half cream", is that like "single cream"?

chc · on Dec 18, 2018

Half cream is less than single cream. AFAIK single cream has half the milkfat of whipping cream, while half-and-half/half cream has about 1/3.

01100011 · on Dec 18, 2018

That seems easy. Keep a pointer to the last item, and if you encounter a 'modifier', modify it.

Vendan · on Dec 18, 2018

yeah, but which way? `Add custard powdered sugar to my shopping list` is is powdered custard, or powdered sugar?

ascar · on Dec 18, 2018

Maybe a parrot!

https://news.ycombinator.com/item?id=18689467

defnotarobot · on Dec 18, 2018

I had exact same realization.

CupOfJava · on Dec 18, 2018

What happened if I said "peanuts, butter". I expect two items, but will Alexa give one?

torstenvl · on Dec 18, 2018

As others have said, there's the pluralization of "peanut[s]" to distinguish between the two. This is a useful feature of English: the adjective-like role of a noun in a complex noun phrase is (almost?) always singular.

    - Computer engineer
    - NOT computers* engineer

    - Toothbrush
    - NOT teethbrush*

    - Foot doctor
    - NOT feet* doctor

    - Alarm clock
    - NOT alarms* clock, even when it supports multiple alarms!

Additionally, there's phrasal intonation. If the intonation and stress decrease throughout the phrase, it's a single item. If the intonation and stress reset for "butter," then it's a new item.

    - 'PEA ,Nut but ter
    - 'PEA nut 'BUT ter

fooker · on Dec 18, 2018

Proudfeet!

sakebomb · on Dec 18, 2018

Alexa, please tell me about the...

Attorneys general Senators elect

Ahhhhhhh!

ksenzee · on Dec 18, 2018

The difference is that these are phrases with adjectives, not nouns being used adjectivally.

a1369209993 · on Dec 18, 2018

"Attorney generals" is a noun phrase (admittedly of questionable adjectivity). "Attorneys general" is a blind idiot translation of a phrase in a language with different grammatical rules (Latin, IIRC).

gowld · on Dec 18, 2018

"Attorney" is a noun. "General" as used here is an adjective. It's unusual in that the adjective follows the noun without a hyphen, but it's common enough, and it's where prepositional phrases are seen, like "Big man on campus" and "powers that be".

Did ancient Romans have attorneys general?

huffmsa · on Dec 18, 2018

No one ever gets a single "peanut". So unless you mush mouth the "S", the reasonable expectation for both your cohabitator and the robot is to bring peanuts and butter.

A better question is "coconut, milk" versus "coconut milk".

grecy · on Dec 18, 2018

> A better question is "coconut, milk" versus "coconut milk".

Sure, but if you were dictating to a human that would still be an easy one for them to get wrong, depending on how long you paused.

I find this interesting with phone numbers. In some countries you hear people say "thirty three sixty two" and they mean 303602

thaumasiotes · on Dec 19, 2018

"Coconut" is still an anomalous grocery item. You'd want one of

- a coconut

- [number] coconuts

- shredded coconut

"Coconut" is best matched to that last option, but it's not a natural word choice. (Although it is a natural list entry... do people think of themselves as dictating to Alexa, or as writing the list themselves while happening to use their voice?)

oralty · on Dec 19, 2018

If i'm making a list as a reminder to actually pick up items.. coconut will suffice.

thaumasiotes · on Dec 19, 2018

Yes, I agree. If you're writing a list for yourself, a bare "coconut" is a typical entry. But if you're dictating a shopping list to someone else, you're quite unlikely to say "coconut" because that isn't grammatical.

So it turns into a question of how people think about dictating to Alexa.

p1necone · on Dec 19, 2018

This is a good point - in reality Alexa doesn't really have to do a great job transcribing at all if it's just constructing a list as a reminder for you later.

If this is a precursor to being able to quickly voice order stuff off amazon to be delivered though it's a different story.

whoopdedo · on Dec 19, 2018

The one I thought of was "peanut butter M&Ms"

viksit · on Dec 18, 2018

This is a very interesting observation. The whole point of speech to text models being biased towards the US in terms of training data and innovation is valid not only across the larger things (gender/race/religion) but just small things like this. And these are likely to cause daily problems.

huffmsa · on Dec 18, 2018

And that's what makes it interesting. Peanut butter versus peanuts, butter is easy. No one gets a single peanut.

As for the phone number, that's why anyone in a serious occupation (aviation, military, etc) treat each digit as is stand-alone.

Three-zero-three-six-zero-two.

coldtea · on Dec 18, 2018

>No one gets a single peanut

You'd be surprised: https://www.youtube.com/watch?v=HoPFQm9PQ_M

tn_ · on Dec 18, 2018

A relevant question is, if you asked that to your spouse, would you be angry if they came home with a jar of peanut butter?

NicoJuicy · on Dec 18, 2018

If it correctly understands peanutS, it will classify it as "more likely 2 items" considering it would check everything against some sort of dictionary. Which contains "peanuts, butter, peanut butter".

PS. I implemented something similar without machine learning and that's how i did it. With text it's easier though, i suppose in NLU it could have a parameter for "pause time between words" which could also contribute to a different conclusion.

jeena · on Dec 18, 2018

The German language is superior in this regard because there it would we either "peanut, butter" or "peanutbutter".

burkaman · on Dec 18, 2018

It's the same in English in this case. There's no pause when you talk about "peanut butter", but there's a significant one when listing the two.

gnulinux · on Dec 18, 2018

How is this relevant? Peanut butter isn't spoken like "peanut, butter" it's spoken like "peanutbutter".

ghaff · on Dec 18, 2018

It depends. If I'm enunciating clearly as I tend to do when I dictate to voice assistants, I'm going to tend to tend to distinctly pronounce the "t" at the end of peanut and the "b" at the beginning of butter which tends to produce something of a pause between the two words.

boomlinde · on Dec 18, 2018

It wouldn't help much with tokenization of spoken words since it's hard to tell from its pronunciation whether an English phrase is compounded, hyphenated or separate words.

SmellyGeekBoy · on Dec 18, 2018

I'd say it was probably worse in this respect, doesn't German have a lot more of these "compound" words that would need to be parsed in this way?

wil421 · on Dec 18, 2018

Exactly, car insurance is much better than Kraftfahrzeug-Haftpflichtversicherung.

Quai · on Dec 18, 2018

If you want to make German look bad, sure, but a "better" translation of car insurance would be Autoversicherung.

I mean, "car insurance" is much better than "motor vehicle liability insurance" too.. ;-)

_emacsomancer_ · on Dec 18, 2018

I know there's still a difference in length, but don't let English orthography fool you, English likes big compounds too: "high voltage electricity grid systems engineer team supervisor" is structurally the same as the unspaced German monstrous compounds. (I.e. the real difference between English and German in this area is just a matter of orthography and not actual linguistic structure.)

mathie25 · on Dec 18, 2018

French would be easier to understand as peanut butter is "beurre d'arachides" (butter from peanuts). The "de" (or d' in that example) is given you the "context"/what the butter is made of. Same for apple juice, it's "jus de pommes" and etc.

snaky · on Dec 18, 2018

It would be easier in languages that make use of a high degree of inflection, especially in Slavic languages, where - that's unusual in other language families - declension also depends on whether the word is a noun or an adjective.

harshulpandav · on Dec 18, 2018

What about "peanut butter cookies"?

drewmate · on Dec 18, 2018

Homer: "Alexa, please order: peanut, butter cookie, d'oh!"

nikofeyn · on Dec 18, 2018

in that case you'd get peanut, butter cookie, and dough, so i suppose you could make a peanut and butter cookie bread. lol.

neap24 · on Dec 18, 2018

There's only one way to find out.

netsharc · on Dec 18, 2018

Reminds me of the time I tested a web shop's ordering form for a a paper catalogue, using a colleague's real address. He got something in the mail a few weeks later...

2sk21 · on Dec 18, 2018

Pretty much every entity detector that I have used operates on a similar principle. Not sure whats new here.

cptskippy · on Dec 18, 2018

While these types of blogs might not be revolutionary, they're still useful to people new to the subject who might just be getting into search or are having to implement lite search functionality into their applications.

I'm actually working on an application now where the initial spec called for "search" and it was implemented as exact token matching. A bug was immediately filed because searches for "wlk", "walk", and "walk event" all returned different results.

TheodolphusRose · on Dec 18, 2018

Have you often seen adversarial training used for sequence labeling to improve generalization across domains? The LSTM-CRF model appears to be the same model proposed by Lample et al (2016). I’d agree that it is common now to use that architecture.

cozzyd · on Dec 18, 2018

I wonder what happens when you order 101 Dalmatians.

zachguo · on Dec 18, 2018

Basic probability theory and information theory can fix most of the problem, just calculate mutual information between adjacent words.

sct202 · on Dec 18, 2018

If you google Elijah Wood, under "People also ask:" one of the options is "is elijah a wood" and I wonder if that's from text to speech that misunderstood.

adjkant · on Dec 18, 2018

no that's a meme lol

sgustard · on Dec 18, 2018

Great. Is "milk chocolate" one item or two?

Klathmon · on Dec 18, 2018

and what would you add in this sentence:

"Alexa add paper towels milk and eggs to my shopping list" (punctuation intentionally left out)

It's a really cool problem to try and solve, and while I don't have an Alexa, I do have a google home which gets this kind of stuff right often enough that I don't really think about it any more (and kind of laugh on the rare chance it gets it wrong).

aportnoy · on Dec 18, 2018

Same for "chocolate milk".

sam0x17 · on Dec 18, 2018

I'm pretty sure you could do this in a standard LALR(1) grammar and give [peanut butter] precedence over [peanut] [butter].

alttab · on Dec 18, 2018

This would require that you have an exhaustive list of priorities typed out in a grammar, for each language. Word embeddings is a more semi-supervised learning. There is no way grammars could cover all the cases in a scalable way.

sam0x17 · on Dec 18, 2018

True, so maybe the best approach is to use machine learning to generate the exhaustive list of priorities based on labeled human speech. Then your end product is something we can understand and tweak, instead of a black-box neural network.

alttab · on Dec 19, 2018

You'd have to human-label that speech. That already won't scale very well, and requires per-language annotation.

Understanding and tweaking it should be done with hyper parameters, not semantic libraries.

anticensor · on Dec 18, 2018

You would need to look ahead recursively as different word orders may be acceptable in a human language. Not all languages are like French.

dugluak · on Dec 18, 2018

because Alexa knows that there a is thing called "peanut butter" and you most probably meant that. I think Humans too behave in a similar fashion. If you are dictating a shopping list to a person who has never heard of "peanut butter" would most likely raise a question and ask you to spell it.

karmasimida · on Dec 18, 2018

So IOBES style NER...how surprising...

Moru · on Dec 18, 2018

How come we are doing so much with machine learning but we still have situations like "You have 1 things in your shoping wagon"

ultramundane8 · on Dec 18, 2018

I think the explanation there is as simple as "there are lots of different people." I can't write software half as well as a significant percentage (maybe a majority) of HN users, but the average person can't write software half as well as me.

Of course, I could just be dead wrong. I'm just conjecturing while I procrastinate.

wgpete · on Dec 18, 2018

All of these AI's are not actually even AI. They are client-server voice commands. True AI requires zero internet connection which no commercially included AI on any smartphone or tablet can offer today.

Analog24 · on Dec 18, 2018

"True AI" (or any kind of AI for that matter) is not defined by the presence or absence of an internet connection.

tomphoolery · on Dec 18, 2018

Did Alexa figure out how to tell when I'm saying "Computer" to it vs. one of my friends?

I want to be able to talk to my computer like in Star Trek, dammit.