Hacker News new | past | comments | ask | show | jobs | submit login
How Alexa knows “peanut butter” is one shopping-list item, not two (amazon.com)
246 points by georgecarlyle76 on Dec 18, 2018 | hide | past | favorite | 140 comments



I implemented a similar BiLSTM-CRF model at my current job. The architecture itself is really interesting, but runs into scaling issues. With LSTMs, you run into the constraint of having to wait on previous inputs and cache those results as well. Although TensorFlow now offers Cuda RNNS and fused kernels to speed up computation, I'd have thought for Amazon's scale, an attention/transformer based architcture would serve them better.

I also notice a lot of dismissive comments about "black box models" or the simple solutions of just parsing out whitespace. My two cents:

1. Models with hand crafted rules perform WORSE than learned representations, especially when you have an end-to-end model with pre trained embeddings. This is shown by one of the seminal papers on this model, Ma and Hovy (2016) https://arxiv.org/pdf/1603.01354.pdf.

"However, even systems that have utilized dis-tributed representations as inputs have used theseto augment, rather than replace, hand-crafted fea-tures (e.g. word spelling and capitalization pat-terns). Their performance drops rapidly when themodels solely depend on neural embeddings"

2. Human speech and human written text are messy. Having a rule for human speech will inevitably lead to a massive list of rules and exceptions to those rules.

3. This model is multi domain, meaning that you don't just need rules for one domain, but rules for multiple domains and interactions between those domains. Considering Amazon's hefty amount of data, it's much more efficient to learn these represntations though a machine learning model rather than constantly playing cat-and-mouse with keeping your hand crafted rules.


Spot on. Just a heads up, there is a decent amount of work on using convolutions to condense the initial representations and can reduce computation time equal to your max pooling. A lot of these tasks can be done via hyperparameter search over CNNs, so you can easily reach parity using a CNN-LSTM approach w/ the same number of parameters.


In text parsing, this whole machine learning would be implemented as "ignore white space as delimiter" ( that's how I implemented it on one of my projects)

Ps. I'm aware this will not be a popular comment


Sentences spoken aloud don't have commas.


You don't need commas to do that in text parsing either. Newlines, for example, will do.

When spoken, a shopping list is not a sentence. There's a small pause and/or different emphasis on the start of each item that can be learned (humans, for one, can discern it).

"Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter".

(Besides it can easily learn that peanut, singular, is not a thing people order: it's either "peanuts" or "peanut-butter" etc).


You're basically saying that they should build a speech recognition system that links words that should belong together with a hyphen... great then: that's exactly what this article is about.


No. They're not saying anything about "words that should belong together". They're saying two things:

1) There is a pronunciation difference between one item with two words, and two items with one word each.

2) You can also use per-word information here, because "peanut" is not something that goes on shopping lists.


No, I'm just using a hyphen on my "transcription" to show the two cases are spoken differently.


> "Eggs milk peanut-butter" sounds different than "Eggs milk peanut butter"

True, but I've never went to the grocery store to buy a singular peanut.

I've bought a bag of peanuts, but not just one peanut.


Eggs milk avocado-oil

Eggs milk cheese-buns

Eggs milk chicken-salad


  pudding pudding pudding applesauce
  pudding pudding pudding applesauce
  pudding pudding pudding applesauce
  applesauce applesauce applesauce


Which I cover directly below.


Or pauses between words


> In text parsing


Are we comparing this to what Alexa is doing or not?


OK, bullets then. At any rate ... a one-line of code solution is a flimsy excuse for a product placement.


I don't understand. You realize the speech recognition algorithm gives you raw words with no punctuation, right?


What single line of code could solve this problem?


Yes, they do. There's evidence to suggest that punctuation marks were devised as pronunciation guides, indicating how to inflect and when to pause, rather than syntactic markers in their own right. Commas in particular indicate a distinctive inflection and short pause in speaking; such would be detectable by Alexa especially if it uses a neural net or similar to analyze human speech.


That was not the point of what I said at all. It Could interpret the vocal cues, yes. As far as I know this is still not a solved problem for speech to text, and results going the other way and trying to guess punctuation is still more reliable.

Back to what I really was getting at: I'm pretty sure the person I replied to was suggesting Alexa could just split(',') and call it a day. With text, yes. With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly. I am certain humans use a mix of vocal cues and interpretation to place the commas in their heads.


No, i'm suggesting:

- ignore the comma, point and space as delimiter and compare the values / entities against a dictionary for neighbouring words.

- don't ignore the comma and compare the values against a dictionary.

Put a priority ( or in machine learning terms: a classifier) on both outcomes, because comma is not reliable in spoken language. So that it would interprit peanuts butter as [peanuts, butter] and peanut butter as [peanut butter].

PS. Now i hope that text-to-speech translates spoken: peanuts correctly to [peanuts] and not [peanut], because that would fail.

PS2. The article itselve doesn't mention the punctuation problem


>PS2. The article itselve doesn't mention the punctuation problem

It doesn't go into detail but it does seem to mention it.

>Off-the-shelf broad parsers are intended to detect coordination structures, but they are often trained on written text with correct punctuation. Automatic speech recognition (ASR) outputs, by contrast, often lack punctuation, and spoken language has different syntactic patterns than written language.


>As far as I know this is still not a solved problem for speech to text

Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.

With voices this would be irritatingly unreliable. Everyone talks differently and sometimes people stumble weirdly.

That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.


>Understanding pauses/inflection changes doesn't have to be a "solved problem" to work for cases such as discerning common shopping list style items.

Okay... But Alexa isn't just shopping lists. You only know you are dealing with a shopping list after parsing the text.

Even if you did go back, is the narrower use case any more solved than the general one? Guessing with text alone turns out to be fairly accurate and so even if you could do this decently, it would have to be notably better to be worth the trouble.

>That's an argument against discerning "milk" from "silk" or "coke" from "cork", but that's still managed satisfactorily enough.

Irrelevant to this though, considering that problem has mostly been solved at this juncture.


Amazon is pulling ahead. Siri still adds 'All man milk' to my shopping list instead of almond milk.


And someone asking Siri to set a reminder to "prepare the dough" only to find that Siri heard something much more morbid.

https://twitter.com/danielpunkass/status/1073723442179031041


Lately Siri gives me an error when setting the timer. But upon inspection has actually set the timer correctly. It's equally infuriating and comical.


Siri has been behind Google and Alexa basically since the beginning and shows little sign of catching up.



This is surprising, I didn't know Alexa was capable of this. Whenever I say "Alexa, add milk, eggs, bread, & laundry detergent to my shopping list" it shows up as one long sentence in a single entry.


It's a fairly new feature.


What about "Alexa, add fork handles to the shopping list"? https://www.youtube.com/watch?v=gi_6SaqVQSw .


You never know when you might need some new fork handles.


This only a surprising thing because a human would parse the waveform as two words and then back into a single concept. A computer could parse it as a single concept directly or in syllable lengthed chunks to be reconfigured however.


A human really wouldn't. It's the same thing as telling the difference between black bird and blackbird in running speech; peanut butter may be spelled as two words, but it's spoken as a single thing. If the concept was new enough that you were consciously talking about a type of butter made from (of all things) peanuts, it would be a black bird vocal entity rather than a blackbird one.


You're still talking about words, but the issue is waveform sound. An english speaker would hear "black bird" and use rhythm and context clues to parse it into words and then into the appropriate idea. A non-english speaker might hear something like "blekabirt" as the waveform sound is interpreted through their particular linguistic habits, then it would be rejected as a non-word sound


It's true that we do alter our speech to provide context clues, it is also true that without them we're _still_ capable of piecing them together.

If someone says in a unnaturally drawn out way "I like peanut butter sandwhiches" then I will have no problem detecting the situation and then re-parsing it correctly.


The space is irrelevant. Consider the sentences:

The black bird ate seeds.

The blackbird flew at mach 3.

Your brain thinks of these two words completely differently and it's only through conscious effort that you think of them together. They are different words even though they sound and are spelled the same, regardless of the space.

A better example I think is "bear feet" vs "bare feet"


"Blackbird" refers to several species of actual bird, not just the plane. "The black bird ate seeds" and "The blackbird ate seeds" are both reasonable sentences, and they do potentially sound different.


I'm not convinced. It takes effort for me to break apart blackbird into two separate words in my head, as they are so commonly found together. When speaking "black bird" I would insert a long pause between the two and emphasise the "b" on bird to show that I'm not talking about a "blackbird".


The difference for me (maybe this is regional?) is that for “blackbird” the stress is on “black” whereas for “black bird” the stress is on “bird”.


Yes, it's mostly about the stress.


I get the impression that US English pronunciation runs the words "peanut butter" together much more that "international" English does.


I recently heard about a book called "Anything You Want" and wanted to add it to my shopping list. I tried repeatedly saying "Alexa, add 'Anything You Want' book to my shopping list" and every variation thereof. Alexa was unable to add that book title no matter what I tried.


This seems like a fundamentally hard problem. If I ask Alexa, "Play songs by Simon and Garfunkel", I may want to include their solo work ("Play songs by ([Paul] Simon) and ([Art] Garfunkel)") or not ("Play songs by (Simon and Garfunkel)"). The choice is probably more likely for some artist groups than others. It may even vary by user. It's hard to imagine a single trained AI that can handle that variance without a ton of very quickly-changing domain knowledge.


In my experience doing stuff like this for artist/song record linkage, the key is really to take a "query expansion" approach rather than a "normalization" approach, because choosing a single normalized form is impossible. So it's better to embrace that there are dozens, hundreds, or even thousands of interpretations and choose probabilistically.

A great example is trying to deal with the "sort name" of artists: e.g. "Presley, Elvis".

It's easy to assume that "Hazlewood, Lee & Nancy Sinatra" means "Lee Hazlewood & Nancy Sinatra".

How bout "Sinatra, Frank & Nancy"? Now the rules are different: the expansion could either be "Frank Sinatra & Nancy Sinatra" (correct) or "Frank Sinatra & Nancy" (but there's no singer who just goes by "Nancy", or is there?)

Now how about "Peter, Paul & Mary"? In that case it's already the literal expanded form referencing three people, not two people named "Paul Peter & Mary Peter" or "Paul Peter & Mary".

So, you just assume they are all possible and rank them based on real-world data. You're right, not always easy!

(Treating them as an unordered bag of tokens can either help or hurt accuracy – that has its own problems when you consider how short and similar many titles are, and how some artists deliberately name themselves as jokes/riffs on a more famous one. Not to mention that after all this it could still be ambigous: MusicBrainz knows about six artists all named "Nirvana". So context is key!)


I presume they will statistically choose the more likely option - ie, people listen to 'Simon and Garfunkel' more than they listen to 'Simon' or 'Garfunkel'.

Humans will also screw this up. They don't have statistics about which you most likely meant but they do have context which an AI may not.


This, yet again shows just how far the computer science is from real natural language processing, despite of all "AI" companies' claims.

Unless they have all and everything hardcoded, even such natural thing are impossible to process for the "natural" language processing programs.

All cloud "AI" and "natural" language processing services should really be called "lots and lots of hardcoded stuff language processing"


State of the art virtual assistants offer little intelligence over a command line interface, except instead of typing the command line in, you say it. Besides that, not much difference; the syntax is rigid and the computer doesn't understand your utterance more intelligently than GCC understands "gcc -o my_prog my_prog.c".


It would be interesting if AI assistants were described honestly in this manner as a "spoken command line interface" instead of as equivalent to human speech recognition.

AI is mechanical turks all the way down.


And even with a rigid structure they only work for common cases. Using Siri in Italian and trying to get an English album playing is night impossible and vice versa.


My girlfriend routinely asks questions with convoluted grammar and gets a relevant response much of the time. I'm always amazed when it happens.


Examples? I seem to constantly get nothing from simple grammar questions. :(


I use a Google Home, not Alexa, but I ask it oddly-worded things all the time. Here's two from the past week that worked: "Can you see if Reply All has a new episode and if so play it?" "Can I still use these green peppers I got two weeks ago?"


That first one sounds promising. I'll have to see if I can adapt it.

My devices are basically just gateways to audible, radio, and general timers. I have begun using the announcement features, but it is amusing to see the kids basically having announcement wars.


Alexa's shopping list definitely has room for improvement if it wants to be more natural. If I want to add bread and peanut butter to my shopping list, I need to say "Alexa, add peanut butter to my shopping list <pause> Alexa, add bread to my shopping list". If I say "Alexa, add bread and peanut butter to my shopping list", it adds a single item called "bread and peanut butter"


That is literally the opposite of what this article says.


How the model is adversarial? Also the best configuration is already found 2 year back in an ACL paper http://www.aclweb.org/anthology/N/N16/N16-1030.pdf. Is not it cheating when you claim somebody else result as your own. The industry is full of fraudster now a days.


Because who orders a single peanut.


immediately think of Fezzik (sp.?) in The Princess Bride

https://i.kym-cdn.com/entries/icons/original/000/022/158/Pea...


Well this is not the key indicator. The key indicator is the absence of pause between the words, that's how we differentiate between "peanut butter" and "peanut, butter".


Far more important than the lack of a pause between then words is the a priori fact that "peanut butter" is a common single item and "peanut, butter" is an uncommon list of items. It is that fact that means that you require a pause between the words to indicate "peanut, butter".

If you ordered "butter, peanuts" for example, it would probably get that it was two items even without the pause between words.

It's all about the prior probabilities.


I don't see why they can't both be important signals. I would hazard a guess that a combined approach is what humans do.

I'm not a linguist or anything, but it seems like in practice people may pronounce "peanut butter" a little differently when they say the two words together. Something like "peanubutter". Or maybe they convert the "t" in "peanut" into a glottal stop.

Anyway, if the "t" is absent when you're talking about peanut butter but present when you're talking about two separate items, I don't see why you shouldn't feel free to use that signal. But I also don't see why you shouldn't use probabilities as well.


I agree. When parsing speech, we humans listen for many cues all at once. Spoken language even has intentional redundancy so we can identify and disregard inconsistent cues. For example, a child or foreign speaker might replace "peanut" with "peanuts" and most people would still have no problem understanding "peanuts butter" as long as the rest of the cues are consistent.


They are both signals. But the prior probability is more important in this case because it is so much more common to say "peanut butter" than "peanut, butter". In other cases the pause might be a more important signal.


I would disagree here. Especially in the context of a shopping list.

If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me to ask for confirmation because that would be enough to create doubt.

'Peanut' and 'butter' are two unrelated words and it is the absence of pause that creates a pseudo single word.

Pauses are extremely important in spoken language and should be exploited.


>If I were to say 'peanut' <pause> 'butter' my interlocutor would probably interrupt me

I'm on the parent's side on this. "Peanut butter" is such a common item that if someone were to pause in between the words I would assume they just got distracted for some reason. In the context of a shopping list, "peanut" singular just doesn't make sense.

A better example would be something like "yogurt ice cream" which is technically incorrect but it's still something people might say. In that case, I'd expect a shorter pause than in the case of yogurt, ice cream. However, if you were dictating a list to me I'd probably ask for confirmation in either case because there's enough ambiguity.


Deliberate pause sounds a bit different than a pause caused by distraction.


If we’re assuming human level AI, maybe and not even then. Nonnative English speakers from languages without a singular/plural distinction often make use singulars where a plural is called for.

Domain knowledge around the intent (making a shopping list) and around language conventions helps a lot, not only for NLU, but also for NLS.

But even without a huge NLU dictionary, a simple MC or CRF segmentation model would work. As said earlier, almost no one says., “Add peanut to the shopping list.” They say, “Add peanuts”, of if they’re very atypical, “Add a peanut”. A bare “peanut” is simply unlikely.


I think the catch is that human detection of pauses is far more fraught with difficulty than you'd imagine. Overly emphasized pauses are easy to discern, but quite a lot of speech pulls words far closer together than you would think. And this is just in english. Not all languages work with emphasis syllables and such.


Exactly, and the difference of the length of pauses between 'syntactic' words (peanut, butter) and between compounded words (peanut-butter) will vary based on speaker and moreover, things like focus/emphasis can cause lengthening of pauses even for the latter case. ("No, Alexa, I'm not shopping for Whistler's Mother; I need to buy PEANUT BUTTER.")


A better example is "coffee cream" (Google fails on this one)


Well, what is it?

I imagine lists with post-specifiers are near impossible to parse too:

    Coffee, instant
    Custard, powdered
    Sugar
Bit contrived, but still.


I see your point for longer lists, but most humans would correctly interpret "Hey, add coffee cream to my shopping list" as one item rather than two.


Assuming most cultures have "coffee cream" in their common vernacular seems risky. I've never heard of it.


This. I'd assume it's like "creamer" which is a fake milk substitute from USA I think? But, it could be coffee flavoured cream; or "coffee creams" which are biscuits in the UK, or maybe some sort of confection?

Yeah, I should just look it up ...


"coffee cream" AKA "half and half" in much of the US or "half cream" in the UK


I (native UK) have never heard of "half cream", is that like "single cream"?


Half cream is less than single cream. AFAIK single cream has half the milkfat of whipping cream, while half-and-half/half cream has about 1/3.


That seems easy. Keep a pointer to the last item, and if you encounter a 'modifier', modify it.


yeah, but which way? `Add custard powdered sugar to my shopping list` is is powdered custard, or powdered sugar?



I had exact same realization.


What happened if I said "peanuts, butter". I expect two items, but will Alexa give one?


As others have said, there's the pluralization of "peanut[s]" to distinguish between the two. This is a useful feature of English: the adjective-like role of a noun in a complex noun phrase is (almost?) always singular.

    - Computer engineer
    - NOT computers* engineer

    - Toothbrush
    - NOT teethbrush*

    - Foot doctor
    - NOT feet* doctor

    - Alarm clock
    - NOT alarms* clock, even when it supports multiple alarms!
Additionally, there's phrasal intonation. If the intonation and stress decrease throughout the phrase, it's a single item. If the intonation and stress reset for "butter," then it's a new item.

    - 'PEA ,Nut but ter
    - 'PEA nut 'BUT ter


Proudfeet!


Alexa, please tell me about the...

Attorneys general Senators elect

Ahhhhhhh!


The difference is that these are phrases with adjectives, not nouns being used adjectivally.


"Attorney generals" is a noun phrase (admittedly of questionable adjectivity). "Attorneys general" is a blind idiot translation of a phrase in a language with different grammatical rules (Latin, IIRC).


"Attorney" is a noun. "General" as used here is an adjective. It's unusual in that the adjective follows the noun without a hyphen, but it's common enough, and it's where prepositional phrases are seen, like "Big man on campus" and "powers that be".

Did ancient Romans have attorneys general?


No one ever gets a single "peanut". So unless you mush mouth the "S", the reasonable expectation for both your cohabitator and the robot is to bring peanuts and butter.

A better question is "coconut, milk" versus "coconut milk".


> A better question is "coconut, milk" versus "coconut milk".

Sure, but if you were dictating to a human that would still be an easy one for them to get wrong, depending on how long you paused.

I find this interesting with phone numbers. In some countries you hear people say "thirty three sixty two" and they mean 303602


"Coconut" is still an anomalous grocery item. You'd want one of

- a coconut

- [number] coconuts

- shredded coconut

"Coconut" is best matched to that last option, but it's not a natural word choice. (Although it is a natural list entry... do people think of themselves as dictating to Alexa, or as writing the list themselves while happening to use their voice?)


If i'm making a list as a reminder to actually pick up items.. coconut will suffice.


Yes, I agree. If you're writing a list for yourself, a bare "coconut" is a typical entry. But if you're dictating a shopping list to someone else, you're quite unlikely to say "coconut" because that isn't grammatical.

So it turns into a question of how people think about dictating to Alexa.


This is a good point - in reality Alexa doesn't really have to do a great job transcribing at all if it's just constructing a list as a reminder for you later.

If this is a precursor to being able to quickly voice order stuff off amazon to be delivered though it's a different story.


The one I thought of was "peanut butter M&Ms"


This is a very interesting observation. The whole point of speech to text models being biased towards the US in terms of training data and innovation is valid not only across the larger things (gender/race/religion) but just small things like this. And these are likely to cause daily problems.


And that's what makes it interesting. Peanut butter versus peanuts, butter is easy. No one gets a single peanut.

As for the phone number, that's why anyone in a serious occupation (aviation, military, etc) treat each digit as is stand-alone.

Three-zero-three-six-zero-two.


>No one gets a single peanut

You'd be surprised: https://www.youtube.com/watch?v=HoPFQm9PQ_M


A relevant question is, if you asked that to your spouse, would you be angry if they came home with a jar of peanut butter?


If it correctly understands peanutS, it will classify it as "more likely 2 items" considering it would check everything against some sort of dictionary. Which contains "peanuts, butter, peanut butter".

PS. I implemented something similar without machine learning and that's how i did it. With text it's easier though, i suppose in NLU it could have a parameter for "pause time between words" which could also contribute to a different conclusion.


The German language is superior in this regard because there it would we either "peanut, butter" or "peanutbutter".


It's the same in English in this case. There's no pause when you talk about "peanut butter", but there's a significant one when listing the two.


How is this relevant? Peanut butter isn't spoken like "peanut, butter" it's spoken like "peanutbutter".


It depends. If I'm enunciating clearly as I tend to do when I dictate to voice assistants, I'm going to tend to tend to distinctly pronounce the "t" at the end of peanut and the "b" at the beginning of butter which tends to produce something of a pause between the two words.


It wouldn't help much with tokenization of spoken words since it's hard to tell from its pronunciation whether an English phrase is compounded, hyphenated or separate words.


I'd say it was probably worse in this respect, doesn't German have a lot more of these "compound" words that would need to be parsed in this way?


Exactly, car insurance is much better than Kraftfahrzeug-Haftpflichtversicherung.


If you want to make German look bad, sure, but a "better" translation of car insurance would be Autoversicherung.

I mean, "car insurance" is much better than "motor vehicle liability insurance" too.. ;-)


I know there's still a difference in length, but don't let English orthography fool you, English likes big compounds too: "high voltage electricity grid systems engineer team supervisor" is structurally the same as the unspaced German monstrous compounds. (I.e. the real difference between English and German in this area is just a matter of orthography and not actual linguistic structure.)


French would be easier to understand as peanut butter is "beurre d'arachides" (butter from peanuts). The "de" (or d' in that example) is given you the "context"/what the butter is made of. Same for apple juice, it's "jus de pommes" and etc.


It would be easier in languages that make use of a high degree of inflection, especially in Slavic languages, where - that's unusual in other language families - declension also depends on whether the word is a noun or an adjective.


What about "peanut butter cookies"?


Homer: "Alexa, please order: peanut, butter cookie, d'oh!"


in that case you'd get peanut, butter cookie, and dough, so i suppose you could make a peanut and butter cookie bread. lol.


There's only one way to find out.


Reminds me of the time I tested a web shop's ordering form for a a paper catalogue, using a colleague's real address. He got something in the mail a few weeks later...


Pretty much every entity detector that I have used operates on a similar principle. Not sure whats new here.


While these types of blogs might not be revolutionary, they're still useful to people new to the subject who might just be getting into search or are having to implement lite search functionality into their applications.

I'm actually working on an application now where the initial spec called for "search" and it was implemented as exact token matching. A bug was immediately filed because searches for "wlk", "walk", and "walk event" all returned different results.


Have you often seen adversarial training used for sequence labeling to improve generalization across domains? The LSTM-CRF model appears to be the same model proposed by Lample et al (2016). I’d agree that it is common now to use that architecture.


I wonder what happens when you order 101 Dalmatians.


Basic probability theory and information theory can fix most of the problem, just calculate mutual information between adjacent words.


If you google Elijah Wood, under "People also ask:" one of the options is "is elijah a wood" and I wonder if that's from text to speech that misunderstood.


no that's a meme lol


Great. Is "milk chocolate" one item or two?


and what would you add in this sentence:

"Alexa add paper towels milk and eggs to my shopping list" (punctuation intentionally left out)

It's a really cool problem to try and solve, and while I don't have an Alexa, I do have a google home which gets this kind of stuff right often enough that I don't really think about it any more (and kind of laugh on the rare chance it gets it wrong).


Same for "chocolate milk".


I'm pretty sure you could do this in a standard LALR(1) grammar and give [peanut butter] precedence over [peanut] [butter].


This would require that you have an exhaustive list of priorities typed out in a grammar, for each language. Word embeddings is a more semi-supervised learning. There is no way grammars could cover all the cases in a scalable way.


True, so maybe the best approach is to use machine learning to generate the exhaustive list of priorities based on labeled human speech. Then your end product is something we can understand and tweak, instead of a black-box neural network.


You'd have to human-label that speech. That already won't scale very well, and requires per-language annotation.

Understanding and tweaking it should be done with hyper parameters, not semantic libraries.


You would need to look ahead recursively as different word orders may be acceptable in a human language. Not all languages are like French.


because Alexa knows that there a is thing called "peanut butter" and you most probably meant that. I think Humans too behave in a similar fashion. If you are dictating a shopping list to a person who has never heard of "peanut butter" would most likely raise a question and ask you to spell it.


So IOBES style NER...how surprising...


How come we are doing so much with machine learning but we still have situations like "You have 1 things in your shoping wagon"


I think the explanation there is as simple as "there are lots of different people." I can't write software half as well as a significant percentage (maybe a majority) of HN users, but the average person can't write software half as well as me.

Of course, I could just be dead wrong. I'm just conjecturing while I procrastinate.


All of these AI's are not actually even AI. They are client-server voice commands. True AI requires zero internet connection which no commercially included AI on any smartphone or tablet can offer today.


"True AI" (or any kind of AI for that matter) is not defined by the presence or absence of an internet connection.


Did Alexa figure out how to tell when I'm saying "Computer" to it vs. one of my friends?

I want to be able to talk to my computer like in Star Trek, dammit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: