I implemented a similar BiLSTM-CRF model at my current job. The architecture itself is really interesting, but runs into scaling issues. With LSTMs, you run into the constraint of having to wait on previous inputs and cache those results as well. Although TensorFlow now offers Cuda RNNS and fused kernels to speed up computation, I'd have thought for Amazon's scale, an attention/transformer based architcture would serve them better.
I also notice a lot of dismissive comments about "black box models" or the simple solutions of just parsing out whitespace. My two cents:
1. Models with hand crafted rules perform WORSE than learned representations, especially when you have an end-to-end model with pre trained embeddings. This is shown by one of the seminal papers on this model, Ma and Hovy (2016) https://arxiv.org/pdf/1603.01354.pdf.
"However, even systems that have utilized dis-tributed representations as inputs have used theseto augment, rather than replace, hand-crafted fea-tures (e.g. word spelling and capitalization pat-terns). Their performance drops rapidly when themodels solely depend on neural embeddings"
2. Human speech and human written text are messy. Having a rule for human speech will inevitably lead to a massive list of rules and exceptions to those rules.
3. This model is multi domain, meaning that you don't just need rules for one domain, but rules for multiple domains and interactions between those domains. Considering Amazon's hefty amount of data, it's much more efficient to learn these represntations though a machine learning model rather than constantly playing cat-and-mouse with keeping your hand crafted rules.
Spot on. Just a heads up, there is a decent amount of work on using convolutions to condense the initial representations and can reduce computation time equal to your max pooling. A lot of these tasks can be done via hyperparameter search over CNNs, so you can easily reach parity using a CNN-LSTM approach w/ the same number of parameters.
In text parsing, this whole machine learning would be implemented as "ignore white space as delimiter" ( that's how I implemented it on one of my projects)
You don't need commas to do that in text parsing either. Newlines, for example, will do.
When spoken, a shopping list is not a sentence. There's a small pause and/or different emphasis on the start of each item that can be learned (humans, for one, can discern it).
"Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter".
(Besides it can easily learn that peanut, singular, is not a thing people order: it's either "peanuts" or "peanut-butter" etc).
You're basically saying that they should build a speech recognition system that links words that should belong together with a hyphen... great then: that's exactly what this article is about.
Yes, they do. There's evidence to suggest that punctuation marks were devised as pronunciation guides, indicating how to inflect and when to pause, rather than syntactic markers in their own right. Commas in particular indicate a distinctive inflection and short pause in speaking; such would be detectable by Alexa especially if it uses a neural net or similar to analyze human speech.
That was not the point of what I said at all. It Could interpret the vocal cues, yes. As far as I know this is still not a solved problem for speech to text, and results going the other way and trying to guess punctuation is still more reliable.
Back to what I really was getting at: I'm pretty sure the person I replied to was suggesting Alexa could just split(',') and call it a day. With text, yes. With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly. I am certain humans use a mix of vocal cues and interpretation to place the commas in their heads.
- ignore the comma, point and space as delimiter and compare the values / entities against a dictionary for neighbouring words.
- don't ignore the comma and compare the values against a dictionary.
Put a priority ( or in machine learning terms: a classifier) on both outcomes, because comma is not reliable in spoken language. So that it would interprit peanuts butter as [peanuts, butter] and peanut butter as [peanut butter].
PS. Now i hope that text-to-speech translates spoken: peanuts correctly to [peanuts] and not [peanut], because that would fail.
PS2. The article itselve doesn't mention the punctuation problem
>PS2. The article itselve doesn't mention the punctuation problem
It doesn't go into detail but it does seem to mention it.
>Off-the-shelf broad parsers are intended to detect coordination structures, but they are often trained on written text with correct punctuation. Automatic speech recognition (ASR) outputs, by contrast, often lack punctuation, and spoken language has different syntactic patterns than written language.
>Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.
Okay... But Alexa isn't just shopping lists. You only know you are dealing with a shopping list after parsing the text.
Even if you did go back, is the narrower use case any more solved than the general one? Guessing with text alone turns out to be fairly accurate and so even if you could do this decently, it would have to be notably better to be worth the trouble.
>That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.
Irrelevant to this though, considering that problem has mostly been solved at this juncture.
This is surprising, I didn't know Alexa was capable of this. Whenever I say "Alexa, add milk, eggs, bread, & laundry detergent to my shopping list" it shows up as one long sentence in a single entry.
This only a surprising thing because a human would parse the waveform as two words and then back into a single concept. A computer could parse it as a single concept directly or in syllable lengthed chunks to be reconfigured however.
A human really wouldn't. It's the same thing as telling the difference between black bird and blackbird in running speech; peanut butter may be spelled as two words, but it's spoken as a single thing. If the concept was new enough that you were consciously talking about a type of butter made from (of all things) peanuts, it would be a black bird vocal entity rather than a blackbird one.
You're still talking about words, but the issue is waveform sound. An english speaker would hear "black bird" and use rhythm and context clues to parse it into words and then into the appropriate idea. A non-english speaker might hear something like "blekabirt" as the waveform sound is interpreted through their particular linguistic habits, then it would be rejected as a non-word sound
It's true that we do alter our speech to provide context clues, it is also true that without them we're _still_ capable of piecing them together.
If someone says in a unnaturally drawn out way "I like peanut butter sandwhiches" then I will have no problem detecting the situation and then re-parsing it correctly.
Your brain thinks of these two words completely differently and it's only through conscious effort that you think of them together. They are different words even though they sound and are spelled the same, regardless of the space.
A better example I think is "bear feet" vs "bare feet"
"Blackbird" refers to several species of actual bird, not just the plane. "The black bird ate seeds" and "The blackbird ate seeds" are both reasonable sentences, and they do potentially sound different.
I'm not convinced. It takes effort for me to break apart blackbird into two separate words in my head, as they are so commonly found together. When speaking "black bird" I would insert a long pause between the two and emphasise the "b" on bird to show that I'm not talking about a "blackbird".
I recently heard about a book called "Anything You Want" and wanted to add it to my shopping list. I tried repeatedly saying "Alexa, add 'Anything You Want' book to my shopping list" and every variation thereof. Alexa was unable to add that book title no matter what I tried.
This seems like a fundamentally hard problem. If I ask Alexa, "Play songs by Simon and Garfunkel", I may want to include their solo work ("Play songs by ([Paul] Simon) and ([Art] Garfunkel)") or not ("Play songs by (Simon and Garfunkel)"). The choice is probably more likely for some artist groups than others. It may even vary by user. It's hard to imagine a single trained AI that can handle that variance without a ton of very quickly-changing domain knowledge.
In my experience doing stuff like this for artist/song record linkage, the key is really to take a "query expansion" approach rather than a "normalization" approach, because choosing a single normalized form is impossible. So it's better to embrace that there are dozens, hundreds, or even thousands of interpretations and choose probabilistically.
A great example is trying to deal with the "sort name" of artists: e.g. "Presley, Elvis".
It's easy to assume that "Hazlewood, Lee & Nancy Sinatra" means "Lee Hazlewood & Nancy Sinatra".
How bout "Sinatra, Frank & Nancy"? Now the rules are different: the expansion could either be "Frank Sinatra & Nancy Sinatra" (correct) or "Frank Sinatra & Nancy" (but there's no singer who just goes by "Nancy", or is there?)
Now how about "Peter, Paul & Mary"? In that case it's already the literal expanded form referencing three people, not two people named "Paul Peter & Mary Peter" or "Paul Peter & Mary".
So, you just assume they are all possible and rank them based on real-world data. You're right, not always easy!
(Treating them as an unordered bag of tokens can either help or hurt accuracy – that has its own problems when you consider how short and similar many titles are, and how some artists deliberately name themselves as jokes/riffs on a more famous one. Not to mention that after all this it could still be ambigous: MusicBrainz knows about six artists all named "Nirvana". So context is key!)
I presume they will statistically choose the more likely option - ie, people listen to 'Simon and Garfunkel' more than they listen to 'Simon' or 'Garfunkel'.
Humans will also screw this up. They don't have statistics about which you most likely meant but they do have context which an AI may not.
State of the art virtual assistants offer little intelligence over a command line interface, except instead of typing the command line in, you say it. Besides that, not much difference; the syntax is rigid and the computer doesn't understand your utterance more intelligently than GCC understands "gcc -o my_prog my_prog.c".
It would be interesting if AI assistants were described honestly in this manner as a "spoken command line interface" instead of as equivalent to human speech recognition.
And even with a rigid structure they only work for common cases. Using Siri in Italian and trying to get an English album playing is night impossible and vice versa.
I use a Google Home, not Alexa, but I ask it oddly-worded things all the time. Here's two from the past week that worked: "Can you see if Reply All has a new episode and if so play it?" "Can I still use these green peppers I got two weeks ago?"
That first one sounds promising. I'll have to see if I can adapt it.
My devices are basically just gateways to audible, radio, and general timers. I have begun using the announcement features, but it is amusing to see the kids basically having announcement wars.
Alexa's shopping list definitely has room for improvement if it wants to be more natural. If I want to add bread and peanut butter to my shopping list, I need to say "Alexa, add peanut butter to my shopping list <pause> Alexa, add bread to my shopping list". If I say "Alexa, add bread and peanut butter to my shopping list", it adds a single item called "bread and peanut butter"
How the model is adversarial? Also the best configuration is already found 2 year back in an ACL paper http://www.aclweb.org/anthology/N/N16/N16-1030.pdf. Is not it cheating when you claim somebody else result as your own. The industry is full of fraudster now a days.
Well this is not the key indicator. The key indicator is the absence of pause between the words, that's how we differentiate between "peanut butter" and "peanut, butter".
Far more important than the lack of a pause between then words is the a priori fact that "peanut butter" is a common single item and "peanut, butter" is an uncommon list of items. It is that fact that means that you require a pause between the words to indicate "peanut, butter".
If you ordered "butter, peanuts" for example, it would probably get that it was two items even without the pause between words.
I don't see why they can't both be important signals. I would hazard a guess that a combined approach is what humans do.
I'm not a linguist or anything, but it seems like in practice people may pronounce "peanut butter" a little differently when they say the two words together. Something like "peanubutter". Or maybe they convert the "t" in "peanut" into a glottal stop.
Anyway, if the "t" is absent when you're talking about peanut butter but present when you're talking about two separate items, I don't see why you shouldn't feel free to use that signal. But I also don't see why you shouldn't use probabilities as well.
I agree. When parsing speech, we humans listen for many cues all at once. Spoken language even has intentional redundancy so we can identify and disregard inconsistent cues. For example, a child or foreign speaker might replace "peanut" with "peanuts" and most people would still have no problem understanding "peanuts butter" as long as the rest of the cues are consistent.
They are both signals. But the prior probability is more important in this case because it is so much more common to say "peanut butter" than "peanut, butter". In other cases the pause might be a more important signal.
I would disagree here. Especially in the context of a shopping list.
If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me to ask for confirmation because that would be enough to create doubt.
'Peanut' and 'butter' are two unrelated words and it is the absence of pause that creates a pseudo single word.
Pauses are extremely important in spoken language and should be exploited.
>If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me
I'm on the parent's side on this. "Peanut butter" is such a common item that if someone were to pause in between the words I would assume they just got distracted for some reason. In the context of a shopping list, "peanut" singular just doesn't make sense.
A better example would be something like "yogurt ice cream" which is technically incorrect but it's still something people might say. In that case, I'd expect a shorter pause than in the case of yogurt, ice cream. However, if you were dictating a list to me I'd probably ask for confirmation in either case because there's enough ambiguity.
If we’re assuming human level AI, maybe and not even then. Nonnative English speakers from languages without a singular/plural distinction often make use singulars where a plural is called for.
Domain knowledge around the intent (making a shopping list) and around language conventions helps a lot, not only for NLU, but also for NLS.
But even without a huge NLU dictionary, a simple MC or CRF segmentation model would work. As said earlier, almost no one says., “Add peanut to the shopping list.” They say, “Add peanuts”, of if they’re very atypical, “Add a peanut”. A bare “peanut” is simply unlikely.
I think the catch is that human detection of pauses is far more fraught with difficulty than you'd imagine. Overly emphasized pauses are easy to discern, but quite a lot of speech pulls words far closer together than you would think. And this is just in english. Not all languages work with emphasis syllables and such.
Exactly, and the difference of the length of pauses between 'syntactic' words (peanut, butter) and between compounded words (peanut-butter) will vary based on speaker and moreover, things like focus/emphasis can cause lengthening of pauses even for the latter case. ("No, Alexa, I'm not shopping for Whistler's Mother; I need to buy PEANUT BUTTER.")
This. I'd assume it's like "creamer" which is a fake milk substitute from USA I think? But, it could be coffee flavoured cream; or "coffee creams" which are biscuits in the UK, or maybe some sort of confection?
As others have said, there's the pluralization of "peanut[s]" to distinguish between the two. This is a useful feature of English: the adjective-like role of a noun in a complex noun phrase is (almost?) always singular.
- Computer engineer
- NOT computers* engineer
- Toothbrush
- NOT teethbrush*
- Foot doctor
- NOT feet* doctor
- Alarm clock
- NOT alarms* clock, even when it supports multiple alarms!
Additionally, there's phrasal intonation. If the intonation and stress decrease throughout the phrase, it's a single item. If the intonation and stress reset for "butter," then it's a new item.
"Attorney generals" is a noun phrase (admittedly of questionable adjectivity).
"Attorneys general" is a blind idiot translation of a phrase in a language with different grammatical rules (Latin, IIRC).
"Attorney" is a noun. "General" as used here is an adjective.
It's unusual in that the adjective follows the noun without a hyphen, but it's common enough, and it's where prepositional phrases are seen, like "Big man on campus" and "powers that be".
No one ever gets a single "peanut". So unless you mush mouth the "S", the reasonable expectation for both your cohabitator and the robot is to bring peanuts and butter.
A better question is "coconut, milk" versus "coconut milk".
"Coconut" is still an anomalous grocery item. You'd want one of
- a coconut
- [number] coconuts
- shredded coconut
"Coconut" is best matched to that last option, but it's not a natural word choice. (Although it is a natural list entry... do people think of themselves as dictating to Alexa, or as writing the list themselves while happening to use their voice?)
Yes, I agree. If you're writing a list for yourself, a bare "coconut" is a typical entry. But if you're dictating a shopping list to someone else, you're quite unlikely to say "coconut" because that isn't grammatical.
So it turns into a question of how people think about dictating to Alexa.
This is a good point - in reality Alexa doesn't really have to do a great job transcribing at all if it's just constructing a list as a reminder for you later.
If this is a precursor to being able to quickly voice order stuff off amazon to be delivered though it's a different story.
This is a very interesting observation. The whole point of speech to text models being biased towards the US in terms of training data and innovation is valid not only across the larger things (gender/race/religion) but just small things like this. And these are likely to cause daily problems.
If it correctly understands peanutS, it will classify it as "more likely 2 items" considering it would check everything against some sort of dictionary. Which contains "peanuts, butter, peanut butter".
PS. I implemented something similar without machine learning and that's how i did it. With text it's easier though, i suppose in NLU it could have a parameter for "pause time between words" which could also contribute to a different conclusion.
It depends. If I'm enunciating clearly as I tend to do when I dictate to voice assistants, I'm going to tend to tend to distinctly pronounce the "t" at the end of peanut and the "b" at the beginning of butter which tends to produce something of a pause between the two words.
It wouldn't help much with tokenization of spoken words since it's hard to tell from its pronunciation whether an English phrase is compounded, hyphenated or separate words.
I know there's still a difference in length, but don't let English orthography fool you, English likes big compounds too: "high voltage electricity grid systems engineer team supervisor" is structurally the same as the unspaced German monstrous compounds. (I.e. the real difference between English and German in this area is just a matter of orthography and not actual linguistic structure.)
French would be easier to understand as peanut butter is "beurre d'arachides" (butter from peanuts). The "de" (or d' in that example) is given you the "context"/what the butter is made of. Same for apple juice, it's "jus de pommes" and etc.
It would be easier in languages that make use of a high degree of inflection, especially in Slavic languages, where - that's unusual in other language families - declension also depends on whether the word is a noun or an adjective.
Reminds me of the time I tested a web shop's ordering form for a a paper catalogue, using a colleague's real address. He got something in the mail a few weeks later...
While these types of blogs might not be revolutionary, they're still useful to people new to the subject who might just be getting into search or are having to implement lite search functionality into their applications.
I'm actually working on an application now where the initial spec called for "search" and it was implemented as exact token matching. A bug was immediately filed because searches for "wlk", "walk", and "walk event" all returned different results.
Have you often seen adversarial training used for sequence labeling to improve generalization across domains? The LSTM-CRF model appears to be the same model proposed by Lample et al (2016). I’d agree that it is common now to use that architecture.
If you google Elijah Wood, under "People also ask:" one of the options is "is elijah a wood" and I wonder if that's from text to speech that misunderstood.
"Alexa add paper towels milk and eggs to my shopping list" (punctuation intentionally left out)
It's a really cool problem to try and solve, and while I don't have an Alexa, I do have a google home which gets this kind of stuff right often enough that I don't really think about it any more (and kind of laugh on the rare chance it gets it wrong).
This would require that you have an exhaustive list of priorities typed out in a grammar, for each language. Word embeddings is a more semi-supervised learning. There is no way grammars could cover all the cases in a scalable way.
True, so maybe the best approach is to use machine learning to generate the exhaustive list of priorities based on labeled human speech. Then your end product is something we can understand and tweak, instead of a black-box neural network.
because Alexa knows that there a is thing called "peanut butter" and you most probably meant that. I think Humans too behave in a similar fashion. If you are dictating a shopping list to a person who has never heard of "peanut butter" would most likely raise a question and ask you to spell it.
I think the explanation there is as simple as "there are lots of different people." I can't write software half as well as a significant percentage (maybe a majority) of HN users, but the average person can't write software half as well as me.
Of course, I could just be dead wrong. I'm just conjecturing while I procrastinate.
All of these AI's are not actually even AI. They are client-server voice commands. True AI requires zero internet connection which no commercially included AI on any smartphone or tablet can offer today.
I also notice a lot of dismissive comments about "black box models" or the simple solutions of just parsing out whitespace. My two cents:
1. Models with hand crafted rules perform WORSE than learned representations, especially when you have an end-to-end model with pre trained embeddings. This is shown by one of the seminal papers on this model, Ma and Hovy (2016) https://arxiv.org/pdf/1603.01354.pdf.
"However, even systems that have utilized dis-tributed representations as inputs have used theseto augment, rather than replace, hand-crafted fea-tures (e.g. word spelling and capitalization pat-terns). Their performance drops rapidly when themodels solely depend on neural embeddings"
2. Human speech and human written text are messy. Having a rule for human speech will inevitably lead to a massive list of rules and exceptions to those rules.
3. This model is multi domain, meaning that you don't just need rules for one domain, but rules for multiple domains and interactions between those domains. Considering Amazon's hefty amount of data, it's much more efficient to learn these represntations though a machine learning model rather than constantly playing cat-and-mouse with keeping your hand crafted rules.