Hacker News new | past | comments | ask | show | jobs | submit login
20B-parameter Alexa model sets new marks in few-shot learning (amazon.science)
112 points by reckel on Aug 2, 2022 | hide | past | favorite | 86 comments



Question for those familiar with the backend of things like Alexa, Google Home, Siri:

At what point can we say things like “turn off the bedroom light in 5 minutes” or “stop music and turn off all the lights”? Even something like “keep the lights on” in a motion sensor system is impossible it seems. Because to me they feel like low hanging fruit, and yet despite all the advances in machine learning and these systems being around for the better part of a decade, anything but the simplest single-task no-modifier commands result in a “sorry… I didn’t understand that” or completely unpredictable results. Is there something inherently difficult about these types of queries?


These are all subtly different problems I think, but in general most of these architectures currently assume there is a single intent.

> stop music and turn off all the lights

This is probably the easiest of the bunch because you are asking it to perform two distinct actions.

> turn off the bedroom light in 5 minutes

This is much more complex, because you are asking the application to setup some sort of workflow - after it understands what you want it to do, it then has to work out how to execute that, which will be utilising the device API's / services. This is a simple example, but there are lots of permutations of different actions here, for example you might want to say "turn off the sound system once this song finishes playing" which assumes that the assistant has the capability to then understand you want it to create a task specifically waiting for the trigger of a particular song finishing playing, and that it has the ability to setup that trigger.

> "keep the lights on” in a motion sensor system

Now this is where the orchestration gets tricky -

The assistant has to:

* Work out that the lights are being affected by a motion sensor system, which is likely outside it's own platform.

* Work out that your intent is that you want the assistant to override that.

* Understand how to connect to the platform in order to control it.

* Work out what parameter it is supposed to alter to achieve this task.

* Override the existing users settings, and presumably reinstate the settings after some portion of time.


> This is much more complex, because you are asking the application to setup some sort of workflow - after it understands what you want it to do, it then has to work out how to execute that, which will be utilising the device API's / services.

I don't see conceptually the difference between the first and the second example. You're still executing two distinct actions, first being the waiting for x amount of time?


The deferred execution makes this more complex as now the yet to be executed part needs to be queued somewhere

It could be issuing a command in the same way you do the first, but probably most lights just support switching them on/off right now, and won't take a delay argument.

So, if you can't just do the same you were doing, how do you do it? If there's some kind of local support, you can issue a delay command to some device in the network that will switch the lights on, but you still need to be able to control that action as the user might want to cancel or adjust the delay too.

If the execution need to happen remotely because there's no local support, then you don't even know that you'll be able to reach the device in 5 minutes (Internet down, router not letting random connections in). And keeping this request queued needs some infra in the server side too together with the necessary APIs to allow adjusting and cancelling requests.


Just put everything in queues, even immediate things.

The AI feeds the action queues, and another isolated component performs the actions, including delays and follow ups.


> If there's some kind of local support, you can issue a delay command to some device in the network that will switch the lights on, but you still need to be able to control that action as the user might want to cancel or adjust the delay too.

I'm pretty sure there is some form of local support, at least for Google. Whne I last used it you had to choose a speaker that automations executed from


So it requires things software engineers do all the time.

Where is the hard part?


But you can apparently tell it “remember I put my keys upstairs”, and when you ask where your keys are, it tells you upstairs.

There is storage going on already


> You're still executing two distinct actions, first being the waiting for x amount of time?

It's not a fixed amount of time. What if you rewind in the song a bit to catch something you missed in the lyrics?


It's not distinct, the second action depends on the first one being finished.

I think this highlights that AI assistants are not really "I", they are good at doing fuzzy, hard to specify things like understanding speech, but there must still be an engine behind that interprets the text and transforms it into execution steps, and that must still be done by developers so it will be limited.


notwithstanding voice->text part:

Is it just me or does it seem like these tasks would be not all that hard if you just, you know, programmed them rather than trying to be so fancy with ML?


> Is it just me or does it seem like these tasks would be not all that hard if you just, you know, programmed them rather than trying to be so fancy with ML?

That's how you do it at the moment. For example, in Google home you have "Routines"[1] which are a list of sequential actions triggered by a key phrase (or optionally something else).

[1] https://support.google.com/googlenest/answer/7029585?


I think the issue is the number of permutations of the tasks.

i.e. you can program the use case "turn off the lights in 5 minutes" but that doesn't cover the use cases of "turn off the lights when i leave the house", "turn off the lights and turn on the tv" or "turn on the lights at 10pm tonight" - and there are so many potential scenarios here that it can quickly scale up.


Language models can generate code from text instructions. It just needs a training set to learn the target APIs. I expect in the next couple of years to see automated desktop operation (RPA) from text commands, generalising access over human interfaces.

It's really a shame the good language models are not deployed as voice assistants. It would probably be expensive to offer and they don't have the scale necessary. Just to load one of these models you need a $100K computer.


It also depends what's the biggest priority - I would assume there is a bigger 'quick win' from becoming more reliable at single-intent actions from a market/customer experience perspective rather than pursuing highly complex multi-intent statements.

>99% of commands will be single intent, and they probably work 80% of the time at the moment, so getting those to 99% will have a much bigger short-term impact than focussing on solving the 1% (with the added benfit that once you have solved the first case of getting single-intent right all the time, solving the second more complex queries will be easier as you will have built a more robust base).


I'm curious: what's the leading edge in commercial chatbots?


Hello, I'm Gretta! I'm here to help. It seems you're looking for information on enterprise chatbots geometry. Answer "yes" to access your account's balance or press # for more options.


> turn off the bedroom light in 5 minutes

I can already do something similar with Siri. "Remind me to take out the trash in 5 minutes". Seems odd that "turn off the bedroom light" isn't trivial to support.


The iOS reminders app already has a concept of "remind user <x> at <y> time" so Siri is just creating one of those.

The bedroom light may not have such a structured concept already existing (or may not expose the full-featured API to Siri).


If it can turn lights on and off, and separately play some sort of reminder sound after five minutes, then it can turn the lights off after five minutes.


It just pipes the command to the appropriate app/service. That's why it doesn't work. Reminder service supports that, but lights service doesn't.

There is no "general ai" doing stuff in the background.


The lights service doesn't have to know anything about the reminder service.

The reminder service needs "one line of code (tm)" to go "After five minutes, send <this> to the lights service".


The problem is that there are so many different permutations of this that the reminder service (and every other service) would then need to accomodate for hundreds of other scenarios, such as:

- When I arrive at home, turn on the lights - After five minutes, play this song - Lock the front door at 10pm

And then the reminders service (which is a to-do app!) suddenly becomes 100x more complicated than it was to accoomodate all these strange use-cases that are outside of it's bounded context.

It's easy to implement a single use case, it's practically impossible to manually code every possible use case (which would also probably create a huge amount of tech debt).


Google assistant/home supports it. Just tried it.


> At what point can we say things like “turn off the bedroom light in 5 minutes”

I often get ready to leave the house and say "start the robovac in five minutes" and it just works. The Google home gadget confirms and states that it will start robovac "at $time-of-day". I've not got too many things connected but I assumed that it's generic among those things that can be turned on/off.


It’s sad that the only option is a voice assistant that must learn how to interpret my words through this slow, error-prone process. I would much rather have a pure speech-to-text option where I must learn the exact words to say to get a reliable result.


There was a scifi story where people on a particular space station used a whistle-based UI, where you would whistle a pattern of tones at the computer and it would beep back.

I'd go for one of those.


So, in essence you want a language that will tell the AI what to do? A programming language, if I may say so?

It's almost if that natural language thing is highly ambiguous which is why we needed to create a new grammar to give precise instructions for machine-driven processes.


Nope, no AI, no NLP. Being so in love with that stuff is why everything is so disappointing now. I’d like a pure speech to text mode where I must get the syntax correct, or the system rejects the command. Eg I must say “set a four minute timer”, not “timer for four minutes”.


Yeah. I wouldn't even mind learning weird syntax or grammar, like

Thread.sleep(5 * 60 * 1000); light.off();


This is basically what iOS shortcuts is hoping to be. You can do some simple coding with widgets like that and then you can invoke your shortcut like `hey siri, <shortcut>`.

Avoids reciting the steps you want and gives you reliability, but I suppose requires any 'variables' to be hardcoded (or for you to create multiple instances)


My benchmark for Siri will be when it learns to do "Siri, wake me up an hour earlier tomorrow". What it currently does it set a new alarm for 1 hour in the future.


I have google home and the first one worked. The 2nd did not but I don't think I'd ever say that bc I just have routines where I say "i'm leaving" or "i'm going to bed" and everything shuts off at once.


I swear there was a golden age in 2016 where voice assistants were magical and worked very well, and then around 2017/2018 it started to:

* Not recognize my voice/command

* Load infinitely when I give a command

* Do the wrong thing entirely


Or my new favorite (Alexa): “It’s going to be 43° today with clear skies. … By the way, did you know I can tell you about great deals on gifts or everyday items? Just say ‘Alexa, tell me about Prime Savings’. Would you like to hear today’s?”

Me: already down the hall cursing my smart speaker.


My guess is that it's a very different (and difficult) problem to generalize that way. Interpreting intent and taking action are different aspects. Someone needs to write code to call a vendor's API to execute those actions and that's a super specialized action. Next step is probably instruct a CoPilot-like tool to do it.


> turn off the bedroom light in 5 minutes

This actually works already with Siri (and as mentioned in a sibling comment with google Home as well). I just tried that for fun a few days ago and was surprised that it actually worked.


It didn’t work for me.

“Turn off the lights in 5 minutes did but “turn off the floorstanding lamp in 5 minutes” did not

Honestly that’s more frustrating when it’s not uniform and now I’ve to remember this weird behavior.


I wonder if the problem is that it doesn't understand what "the lights" is? It must depend on how you've tagged these things, right?

In the Phillips* Hue app at least, they have the idea of lights independently, but you can also group them into rooms and stuff like that. So the multiple bulbs in my floor lamp are in a "lamp" grouping but also a "living room" grouping. It all seems quite flexible.

What about something like "all lights" "living room lights" etc etc?

* Or rather, whoever bought that brand from Phillips


Does home have some label for that? Can you say turn of the floorstanding lamp and it works? If so that would indeed be very weird. I have a sleeping room in home and said it should turn of the light in the sleeping room in 5 minutes and it worked. May be it only works with rooms.


>turn off the bedroom light in 5 minutes

This works for me, except when it mishears what I'm saying as "turn off the light for 5 minutes", which gets a bit annoying


These systems have important complexity layers that may not be immediately apparent: latency and edge hardware limits.

Yes, maybe, gpu cluster server can understand you quickly, but taking whatever model you have and getting it to work quickly enough for people not to be upset is a giant problem.


Anecdotal story, if you have a group of lights, you can't just say to alexa "turn lights off", she will keep asking which one you want. I ended up hardcoding "turn lights off" as a routine trigger.

You had one job, Amazon!


Also why doesn’t Siri work at all in Honda Civics and Honda HR-V’s when connected to car play and driving on the highway with no radio playing?

Google assistant works fine, for the most part anyways.


Weird. Works fine in an Insight, which is in all respects a Civic Hybrid.


> We follow Hoffmann et al. (2022) and pre-train the model for roughly 1 Trillion tokens (longer than the 300B token updates of GPT-3).

If I'm understanding the discussion of the Chinchilla paper correctly[0] then this should offer a significantly better boost than increasing the number of parameters would have. Also really cool that they make the model easy(ish) to run and play with!

[0]: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...


Not sure how much scaling laws apply here, since this is a seq-to-seq model instead of a autoregressive causal model. It's interesting to see AlexaTM performing better than GPT-3 on SuperGLUE and SQuADv2, but it fails on Chain of Thought prompt, which is a bummer. So, is it because it's a different model or because it is positively leveraging multilingual tokens? I wish they compared this architecture to a classic GPT family model.


20 billion parameters and the UI for voice is still cringe level terrible.

Or is it just me and I’ve turned into a get-off-my-lawn curmudgeon when it comes to audio interfaces?

> Find a reservation far from my work location in eight hours for 8 people at Union Auto Company.

Said absolutely no one ever, right? I guess if this is what it’s trained on it’s no wonder.


I can't parse that sentence.


Yeah many options:

- (Find a reservation) (far from my work location in eight hours) (for 8 people at Union Auto Company).

- (Find a reservation far from my work location) (in eight hours) (for 8 people at Union Auto Company).

- (Find a reservation far from my work location in eight hours for 8 people) (at Union Auto Company).

lots more options but it is confusing if you start to dig in


How interesting! (BTW I don't see where this sentence is coming from, and I looked in the papers too)

Here's how Spacy sees it: [1]

So in that case it thinks the command is "in 8 hours find a reservation that is far from my work location and is for 8 people at Union Auto Company".

This parse is almost certainly incorrect.

[1] https://explosion.ai/demos/displacy?text=Find%20a%20reservat...



Yeah but where did they get that sentence?


It's in the figure - in the yellow bubble under "Few input examples in English".

Interestingly on iPhone I could copy/paste the text, but on desktop Chrome I can't, and Ctrl-F doesn't work on it. Vector text in a .webp file?


Thanks!

It's not vector text (webp doesn't do that). You are seeing the magic text recognition that iOS can do on images.


Wow, that is a very pleasing and seamless UX to just treat it like actual embedded text.

Not often am I actually pleasantly surprised by something new in UI. Apple Pay on the Web might have been my last ‘Woah’ moment actually.


Yeah it's pretty nice isn't it. I was confused because I'd done Cmd-F search on every page - maybe Firefox (which I use) needs to incorporate a similar feature. Hmm.


That's actually an interesting thought I've been having, will audio interfaces eventually train humans to speak in a certain way?


A boss of mine was adamant that it took humans a year or so to learn to reliably communicate in a new workgroup, for one person to say a thing and have everyone else understand what was meant.


Oh we try - it still doesn’t work.


This is the type of thing you learn to say when you're dealing with a slot filling algorithm that allows for overspecification. By putting it in one utterance, you avoid it saying coming back and saying "Where do you want a reservation?" "When do you want your reservations?" "How many people are in your party?"


The most notable thing about this model is that they use fewer parameters (20 billion) than many of the other LLM which makes it less resource intensive to train and easier to run.

They also use an encoder-decoder architecture, which is common for machine translation, unlike most large language models which are decoder-only.

https://community.libretranslate.com/t/alexatm-a-20b-multili...


As a complete outsider: has ML research just become a phallus measuring contest to see who can stuff the most parameters into a model? In other words, who can acquire the most Nvidia cards? The model size seems to always be the headline in stuff I see on HN.


This is a small model keeping up with the big guys. 20B parameters might fit in 2 beefy GPUs, that's a bargain compared to GPT-3.


Yeah this is the opposite, they did impressively well with fewer parameters.

In general larger models and more data has been an effective strategy for getting better performance but getting the right ratio is also important: https://www.deepmind.com/publications/an-empirical-analysis-...


+1, also this is a teacher model. The implications are huge here as AWS will likely spin this into an offering like they did with their other AI products. Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task.

You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s). There are some interesting examples of this with smaller models like BERT/BART or PEGASUS in Huggingface Transformer’s seq2seq distillation examples.


As others pointed out, this paper tries to do more with fewer params.

But you've identified a trend does actually describe large language models for the past few years (they've been getting bigger, and bigger has been better). Like microprocessors have the famous tick/tock cycle (https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model), I think models might see be seeing something similar emerge naturally (make models bigger --> make models better (shrink) --> make models bigger again --> make models better (shink again)).

Also, most of this LLM stuff is probably not trained on NVidia hardware -- at scale it's probably cost prohibitive if not also hard to set up. Google's TPUs, MSFT/Amazon's equivalent custom hardware, or other specialized accelerators are more economical overall.


Ok now how about having Alexa turn off the music, when I say “Alexa, turn off the music”?


How about the wakeup word even working reliably anymore or not offering me an upsell on every other interaction.


This blows my mind. How is it even possible to validate a model that incorporates 20B parameters? How do you even test something this complex and non-deterministic?

I assume some kind of infallible automated tooling is used to write tests that validate this monster. I would LOVE to see what that tooling looks like.


It _is_ deterministic (same input gives same output).

You typically don't "test" pairs of inputs/outputs for a model. Instead you measure its performance by defining metrics e.g. "what's the ROUGE-2 score on summarization after fine-tuning AlexaTM 20B using N examples from dataset Y"

You can test some aspects of ML models, like sync testing (if you train on hardware A and run on hardware B, their results are not always the same). But generally you test the code that embeds the model, not the model itself.


How do you define validate? These models aren't formally proven to work in all cases or anything. They're just tested on a load of data, and if it's found that they work pretty well, then they get released.


So, like 90% of the non-FAANG software that we use every day


>"excels other large language models"

It's also good that its announcement excels at having a grammatically correct subtitle.



Should we worry if it achieves sentience? Any reason why we shouldn't?


I would assume consciousness/sentience requires significant feedback loops, so the "thought process" can "keep going". I don't think any of these models have real feedback loops.


They do during training, but not when you use them


No more than you should worry about your phone, your toaster, or a boulder suddenly becoming self-aware.

Extremely theoretically, a consciousness could spontaneously form at any point [1]. In practice, there is no reason to worry about this - it's not a likely event. There is nothing about this model (or any other one in existence) that increases the likelihood of it achieving sentience when compared to anything else ever.

[1] https://en.wikipedia.org/wiki/Boltzmann_brain


Probably, for ethical reasons. but we should worry a lot more if it achieves sapience. I'm also pretty sure that anything (sapience, sentience, whatever) requires closing the loop so the perceiver can perceive itself perceiving, though the idea that I am merely a single Boltzmannesque forward-pass of a language model will certainly run in the back of my head as I'm trying to drift off to sleep tonight.


Assuming the premise of this achieving sentience...

> Should we worry if it achieves sentience?

No, because it's easy to kill.


So should we worry if it learns to propagate itself? :)


I don't really buy the whole "it teaches itself to hack itself out" idea. I think a sentient AI would be able to introspect and manipulate its operating environment if it was specifically designed to have that capability. Another scenario might be that somebody installs the sentient AI inside an drone submarine with nuclear armed SLBMs.

You can contrive scenarios where the AI is given capabilities that make it hard to kill. But baring such scenarios, if the program scares you then just SIGKILL it.


I meant that solely as a joke. I can kind of buy that someday systems can replicate themselves. But I don't see that on any near horizon; such that I agree with you on this.


Depends on what you define sentience as.

Is it the presence of a soul, or the presence of capability to learn stuff it is not trained for?


Alexa is trash.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: