Question for those familiar with the backend of things like Alexa, Google Home, Siri:
At what point can we say things like “turn off the bedroom light in 5 minutes” or “stop music and turn off all the lights”? Even something like “keep the lights on” in a motion sensor system is impossible it seems. Because to me they feel like low hanging fruit, and yet despite all the advances in machine learning and these systems being around for the better part of a decade, anything but the simplest single-task no-modifier commands result in a “sorry… I didn’t understand that” or completely unpredictable results. Is there something inherently difficult about these types of queries?
These are all subtly different problems I think, but in general most of these architectures currently assume there is a single intent.
> stop music and turn off all the lights
This is probably the easiest of the bunch because you are asking it to perform two distinct actions.
> turn off the bedroom light in 5 minutes
This is much more complex, because you are asking the application to setup some sort of workflow - after it understands what you want it to do, it then has to work out how to execute that, which will be utilising the device API's / services. This is a simple example, but there are lots of permutations of different actions here, for example you might want to say "turn off the sound system once this song finishes playing" which assumes that the assistant has the capability to then understand you want it to create a task specifically waiting for the trigger of a particular song finishing playing, and that it has the ability to setup that trigger.
> "keep the lights on” in a motion sensor system
Now this is where the orchestration gets tricky -
The assistant has to:
* Work out that the lights are being affected by a motion sensor system, which is likely outside it's own platform.
* Work out that your intent is that you want the assistant to override that.
* Understand how to connect to the platform in order to control it.
* Work out what parameter it is supposed to alter to achieve this task.
* Override the existing users settings, and presumably reinstate the settings after some portion of time.
> This is much more complex, because you are asking the application to setup some sort of workflow - after it understands what you want it to do, it then has to work out how to execute that, which will be utilising the device API's / services.
I don't see conceptually the difference between the first and the second example. You're still executing two distinct actions, first being the waiting for x amount of time?
The deferred execution makes this more complex as now the yet to be executed part needs to be queued somewhere
It could be issuing a command in the same way you do the first, but probably most lights just support switching them on/off right now, and won't take a delay argument.
So, if you can't just do the same you were doing, how do you do it?
If there's some kind of local support, you can issue a delay command to some device in the network that will switch the lights on, but you still need to be able to control that action as the user might want to cancel or adjust the delay too.
If the execution need to happen remotely because there's no local support, then you don't even know that you'll be able to reach the device in 5 minutes (Internet down, router not letting random connections in).
And keeping this request queued needs some infra in the server side too together with the necessary APIs to allow adjusting and cancelling requests.
> If there's some kind of local support, you can issue a delay command to some device in the network that will switch the lights on, but you still need to be able to control that action as the user might want to cancel or adjust the delay too.
I'm pretty sure there is some form of local support, at least for Google. Whne I last used it you had to choose a speaker that automations executed from
It's not distinct, the second action depends on the first one being finished.
I think this highlights that AI assistants are not really "I", they are good at doing fuzzy, hard to specify things like understanding speech, but there must still be an engine behind that interprets the text and transforms it into execution steps, and that must still be done by developers so it will be limited.
Is it just me or does it seem like these tasks would be not all that hard if you just, you know, programmed them rather than trying to be so fancy with ML?
> Is it just me or does it seem like these tasks would be not all that hard if you just, you know, programmed them rather than trying to be so fancy with ML?
That's how you do it at the moment. For example, in Google home you have "Routines"[1] which are a list of sequential actions triggered by a key phrase (or optionally something else).
I think the issue is the number of permutations of the tasks.
i.e. you can program the use case "turn off the lights in 5 minutes" but that doesn't cover the use cases of "turn off the lights when i leave the house", "turn off the lights and turn on the tv" or "turn on the lights at 10pm tonight" - and there are so many potential scenarios here that it can quickly scale up.
Language models can generate code from text instructions. It just needs a training set to learn the target APIs. I expect in the next couple of years to see automated desktop operation (RPA) from text commands, generalising access over human interfaces.
It's really a shame the good language models are not deployed as voice assistants. It would probably be expensive to offer and they don't have the scale necessary. Just to load one of these models you need a $100K computer.
It also depends what's the biggest priority - I would assume there is a bigger 'quick win' from becoming more reliable at single-intent actions from a market/customer experience perspective rather than pursuing highly complex multi-intent statements.
>99% of commands will be single intent, and they probably work 80% of the time at the moment, so getting those to 99% will have a much bigger short-term impact than focussing on solving the 1% (with the added benfit that once you have solved the first case of getting single-intent right all the time, solving the second more complex queries will be easier as you will have built a more robust base).
Hello, I'm Gretta! I'm here to help. It seems you're looking for information on enterprise chatbots geometry. Answer "yes" to access your account's balance or press # for more options.
I can already do something similar with Siri. "Remind me to take out the trash in 5 minutes". Seems odd that "turn off the bedroom light" isn't trivial to support.
If it can turn lights on and off, and separately play some sort of reminder sound after five minutes, then it can turn the lights off after five minutes.
The problem is that there are so many different permutations of this that the reminder service (and every other service) would then need to accomodate for hundreds of other scenarios, such as:
- When I arrive at home, turn on the lights
- After five minutes, play this song
- Lock the front door at 10pm
And then the reminders service (which is a to-do app!) suddenly becomes 100x more complicated than it was to accoomodate all these strange use-cases that are outside of it's bounded context.
It's easy to implement a single use case, it's practically impossible to manually code every possible use case (which would also probably create a huge amount of tech debt).
> At what point can we say things like “turn off the bedroom light in 5 minutes”
I often get ready to leave the house and say "start the robovac in five minutes" and it just works. The Google home gadget confirms and states that it will start robovac "at $time-of-day". I've not got too many things connected but I assumed that it's generic among those things that can be turned on/off.
It’s sad that the only option is a voice assistant that must learn how to interpret my words through this slow, error-prone process. I would much rather have a pure speech-to-text option where I must learn the exact words to say to get a reliable result.
There was a scifi story where people on a particular space station used a whistle-based UI, where you would whistle a pattern of tones at the computer and it would beep back.
So, in essence you want a language that will tell the AI what to do? A programming language, if I may say so?
It's almost if that natural language thing is highly ambiguous which is why we needed to create a new grammar to give precise instructions for machine-driven processes.
Nope, no AI, no NLP. Being so in love with that stuff is why everything is so disappointing now. I’d like a pure speech to text mode where I must get the syntax correct, or the system rejects the command. Eg I must say “set a four minute timer”, not “timer for four minutes”.
This is basically what iOS shortcuts is hoping to be. You can do some simple coding with widgets like that and then you can invoke your shortcut like `hey siri, <shortcut>`.
Avoids reciting the steps you want and gives you reliability, but I suppose requires any 'variables' to be hardcoded (or for you to create multiple instances)
My benchmark for Siri will be when it learns to do "Siri, wake me up an hour earlier tomorrow". What it currently does it set a new alarm for 1 hour in the future.
I have google home and the first one worked. The 2nd did not but I don't think I'd ever say that bc I just have routines where I say "i'm leaving" or "i'm going to bed" and everything shuts off at once.
Or my new favorite (Alexa): “It’s going to be 43° today with clear skies. … By the way, did you know I can tell you about great deals on gifts or everyday items? Just say ‘Alexa, tell me about Prime Savings’. Would you like to hear today’s?”
Me: already down the hall cursing my smart speaker.
My guess is that it's a very different (and difficult) problem to generalize that way. Interpreting intent and taking action are different aspects. Someone needs to write code to call a vendor's API to execute those actions and that's a super specialized action. Next step is probably instruct a CoPilot-like tool to do it.
This actually works already with Siri (and as mentioned in a sibling comment with google Home as well). I just tried that for fun a few days ago and was surprised that it actually worked.
I wonder if the problem is that it doesn't understand what "the lights" is? It must depend on how you've tagged these things, right?
In the Phillips* Hue app at least, they have the idea of lights independently, but you can also group them into rooms and stuff like that. So the multiple bulbs in my floor lamp are in a "lamp" grouping but also a "living room" grouping. It all seems quite flexible.
What about something like "all lights" "living room lights" etc etc?
* Or rather, whoever bought that brand from Phillips
Does home have some label for that? Can you say turn of the floorstanding lamp and it works? If so that would indeed be very weird. I have a sleeping room in home and said it should turn of the light in the sleeping room in 5 minutes and it worked. May be it only works with rooms.
These systems have important complexity layers that may not be immediately apparent: latency and edge hardware limits.
Yes, maybe, gpu cluster server can understand you quickly, but taking whatever model you have and getting it to work quickly enough for people not to be upset is a giant problem.
Anecdotal story, if you have a group of lights, you can't just say to alexa "turn lights off", she will keep asking which one you want. I ended up hardcoding "turn lights off" as a routine trigger.
> We follow Hoffmann et al. (2022) and pre-train the model for roughly 1 Trillion tokens (longer than the 300B token updates of GPT-3).
If I'm understanding the discussion of the Chinchilla paper correctly[0] then this should offer a significantly better boost than increasing the number of parameters would have. Also really cool that they make the model easy(ish) to run and play with!
Not sure how much scaling laws apply here, since this is a seq-to-seq model instead of a autoregressive causal model. It's interesting to see AlexaTM performing better than GPT-3 on SuperGLUE and SQuADv2, but it fails on Chain of Thought prompt, which is a bummer. So, is it because it's a different model or because it is positively leveraging multilingual tokens? I wish they compared this architecture to a classic GPT family model.
Yeah it's pretty nice isn't it. I was confused because I'd done Cmd-F search on every page - maybe Firefox (which I use) needs to incorporate a similar feature. Hmm.
A boss of mine was adamant that it took humans a year or so to learn to reliably communicate in a new workgroup, for one person to say a thing and have everyone else understand what was meant.
This is the type of thing you learn to say when you're dealing with a slot filling algorithm that allows for overspecification. By putting it in one utterance, you avoid it saying coming back and saying "Where do you want a reservation?" "When do you want your reservations?" "How many people are in your party?"
The most notable thing about this model is that they use fewer parameters (20 billion) than many of the other LLM which makes it less resource intensive to train and easier to run.
They also use an encoder-decoder architecture, which is common for machine translation, unlike most large language models which are decoder-only.
As a complete outsider: has ML research just become a phallus measuring contest to see who can stuff the most parameters into a model? In other words, who can acquire the most Nvidia cards? The model size seems to always be the headline in stuff I see on HN.
+1, also this is a teacher model. The implications are huge here as AWS will likely spin this into an offering like they did with their other AI products. Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task.
You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s). There are some interesting examples of this with smaller models like BERT/BART or PEGASUS in Huggingface Transformer’s seq2seq distillation examples.
As others pointed out, this paper tries to do more with fewer params.
But you've identified a trend does actually describe large language models for the past few years (they've been getting bigger, and bigger has been better). Like microprocessors have the famous tick/tock cycle (https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model), I think models might see be seeing something similar emerge naturally (make models bigger --> make models better (shrink) --> make models bigger again --> make models better (shink again)).
Also, most of this LLM stuff is probably not trained on NVidia hardware -- at scale it's probably cost prohibitive if not also hard to set up. Google's TPUs, MSFT/Amazon's equivalent custom hardware, or other specialized accelerators are more economical overall.
This blows my mind. How is it even possible to validate a model that incorporates 20B parameters? How do you even test something this complex and non-deterministic?
I assume some kind of infallible automated tooling is used to write tests that validate this monster. I would LOVE to see what that tooling looks like.
It _is_ deterministic (same input gives same output).
You typically don't "test" pairs of inputs/outputs for a model. Instead you measure its performance by defining metrics e.g. "what's the ROUGE-2 score on summarization after fine-tuning AlexaTM 20B using N examples from dataset Y"
You can test some aspects of ML models, like sync testing (if you train on hardware A and run on hardware B, their results are not always the same). But generally you test the code that embeds the model, not the model itself.
How do you define validate? These models aren't formally proven to work in all cases or anything. They're just tested on a load of data, and if it's found that they work pretty well, then they get released.
I would assume consciousness/sentience requires significant feedback loops, so the "thought process" can "keep going". I don't think any of these models have real feedback loops.
No more than you should worry about your phone, your toaster, or a boulder suddenly becoming self-aware.
Extremely theoretically, a consciousness could spontaneously form at any point [1]. In practice, there is no reason to worry about this - it's not a likely event. There is nothing about this model (or any other one in existence) that increases the likelihood of it achieving sentience when compared to anything else ever.
Probably, for ethical reasons. but we should worry a lot more if it achieves sapience. I'm also pretty sure that anything (sapience, sentience, whatever) requires closing the loop so the perceiver can perceive itself perceiving, though the idea that I am merely a single Boltzmannesque forward-pass of a language model will certainly run in the back of my head as I'm trying to drift off to sleep tonight.
I don't really buy the whole "it teaches itself to hack itself out" idea. I think a sentient AI would be able to introspect and manipulate its operating environment if it was specifically designed to have that capability. Another scenario might be that somebody installs the sentient AI inside an drone submarine with nuclear armed SLBMs.
You can contrive scenarios where the AI is given capabilities that make it hard to kill. But baring such scenarios, if the program scares you then just SIGKILL it.
I meant that solely as a joke. I can kind of buy that someday systems can replicate themselves. But I don't see that on any near horizon; such that I agree with you on this.
At what point can we say things like “turn off the bedroom light in 5 minutes” or “stop music and turn off all the lights”? Even something like “keep the lights on” in a motion sensor system is impossible it seems. Because to me they feel like low hanging fruit, and yet despite all the advances in machine learning and these systems being around for the better part of a decade, anything but the simplest single-task no-modifier commands result in a “sorry… I didn’t understand that” or completely unpredictable results. Is there something inherently difficult about these types of queries?