Hacker News new | past | comments | ask | show | jobs | submit login
Towards a conversational agent that can chat about anything (googleblog.com)
177 points by theafh on Jan 28, 2020 | hide | past | favorite | 104 comments



From the very first sample conversation they released:

- Human: Hi!

- Meena: Hey there! What's up?

- Human: Same old, Monday blues...what are you up to?

- Meena: I'm procrastinating on a paper I'm supposed to be writing.

- Human: are you thinking of pulling an all nighter?

- Meena: Possibly. What's your paper on?

The bot has already mistaken facts about its own crafted persona and the human in less than 2 meaningful interactions. It's hard to see how this represents any progress in the field at all.

I remain skeptical until we see better results.

from: https://github.com/google-research/google-research/blob/mast...


these nlp models are getting better and better but what we need ofc is for some model of the world to be constructed during the speech. If I tell you that yesterday I accidentally knocked a glass of water off the table and it fell on to soft carpet you could guess that it survived the fall without shattering. What we need is a chatbot that as you talk to it can update a 3d game/physics engine model from your words so that common sense implications of your statements can be gleaned. As you speak it will simulate what you are describing and then use info gathered from the simulation to draw conclusions.


This might sound unrealistically complicated, and the approach suggested here certainly is, but the idea of simulating a world is a very important concept in our current models of human intelligence, with very small children spending a lot of their time learning basic concepts of cause and effect, properties of objects, materials and physics and so on.

It reminds me of something I do in my current occupation, with a robot. It makes predictions about how a certain motor movement will result in a change in position, then measures progress in the real world. Should the prediction and reality diverge significantly (more than a particular amount), it stops and calls a human to look into what happened. I can imagine AI techniques using this kind of thing to guess what will happen next in a series of events, with the result's divergance from the guess weighting the learning rather than a steadily decreasing weight. Perhaps this is already a thing though, I'm no AI researcher


>very small children spending a lot of their time learning basic concepts of cause and effect, properties of objects, materials and physics and so on.

i do not believe that young children explicitly/consciously simulate in their minds to learn cause and effect and other things, though this might be up for debate


Who said anything about "consciously"? It doesn't need to be conscious to affect the conversation.


as to whether it is conscious or not influences whether you would want to model it a-la 3d physics engine, or if not then perhaps there is a more elegant solution we do not know of yet


You definitely need a model of the world, but you probably don't want to be constructing it during speech. What you want is a generative model that you've trained on the simulation data ahead of time so you can quickly make inferences once deployed.


How would that make a conversation more lifelike? It'll still be a limited model of the world, unless somehow you intend to give it the entire breadth of human experience, and will still fall to the same shortcomings as every chatbot, ask it about something it doesn't know and it'll try and fit a response that gives it away as a bot, act confused in a way that gives it away as a bot or just say something nonsensical.

There's things like metaphors, slang, topical references, all things that end up in conversations that change regularly in their meaning and usage. Then there's the unspoken parts of conversations. The implications behind different wordings, ways of writing things, words left unsaid, sarcasm, tone(even in text tone is conveyed in some form).

Humans do all these things while conversing almost subconsciously, each individual person does these things and comprehends these things uniquely based on years and years of sensory input and model building from a huge variety of sources, in a way that's individually unique and poorly understood. How would you even begin to train a model to even a remote semblance of that?


You don't even have to trick it with slang. Maybe I am behind but so far I have never seen a bot that can consistently or even most of the time understand context. They seem to all seriously struggle with anything more than a random sentence that was determined a probable result of the last message the user sent.


Well, often times human beings miss metaphors, slang, and topical references, especially if they are speaking in another cultural group.

The gulf between even the most disparate human tribes is nothing compared to the gulf between an AI chat-bot and a human being.

I think it's entirely possible to make bots with simple models of the world. (Not that this is really a move any closer to that)


Kind of an open problem on how to make a generative model which is sufficiently powerful :)

But that is basically what humans do - we don't run internal simulations, we have a model we've trained over a lifetime of experiences with the world that judges the likelihood of a given scenario


Judea Pearl has been talking about this gap like forever now. According to him, causal modeling is the missing piece. I do believe it is a serious step up (i mean, does it take a model to look at 350GB of text to have that presented conversation?) .. however I don't know enough to know whether that is enough.

(Reading TheBookOfNow and Causality in parallel)


I think you mean the book of why, not of now


Correct. Thanks and sorry about the typo. Too late to edit it, looks like.


I second the approach, although not in terms of physics engine or world model but models that hold metadata about how an abstraction affect another one. When all the abstractions and their relationships could be mapped out or inferred, the models would start making better sense.


GPT-2 showed us that simply having a huge parameter space can be enough for internal consistency. Have you read the unicorn valley article it generated?


The output was cherry picked.


This


Haha, indeed.

I think we're trying to stretch too far here. We've always dreamt of having a conversation with a bot without actually constructing the intelligence behind that bot. And it's not going to work.

We should start with something simpler, like cats, which definitely have feeling, but have a mental capacity that is closer to what we may be able to emulate through software. And then monitor progress through its 'meowing', 'purring' and 'tail wagging'.


We've always dreamt of having a conversation with a bot without actually constructing the intelligence behind that bot. And it's not going to work.

Well, at least the ML people have.

This is automated nattering. These systems have no clue, but pretend they do. It's autocomplete with a big data set. The illusion falls apart after about three sentences or interactions. That's not much better than classic Eliza.

Most of the systems that can actually answer questions have some notion of "slots". As in, you ask "How do I get to X", and that matches something like "How do I get to DESTINATION from STARTINGPOINT by TRANSPORTATIONMODE". There are limited systems like that, slightly smarter than phone trees. You can get something done them in their limited area.

MIT has the START system which does something like that. Stanford has the SQUAD database. Both draw heavily from Wikipedia. I gather Alexa works something like that when ordering, using the Amazon catalog. Eliza had a few "slots"; if you revealed your name, or the name of wife/mother/siblings, those were stored for later template instantiation. It's not much of a mental model, but it's something.

Few of these systems have enough smarts that if you ask a question that needs clarification, they ask you a question to get the data they need. "How do I get to Pittsburgh?" ought to come back with "Where are you", or "Pittsburgh, PA, or Pittsburgh, CA", to fill in the slots. Without someone having to make up a menu tree for that class of question.


This seems that a more complex approach, not simpler. To try to emulate a form of intelligence we don't understand and even worse, monitor its progress through some proxy interactions that don't tell us much.

Like if you're going to create a creature from scratch why create something that can't communicate with you?

Unless you mean we should just tackle simpler forms of intelligence in general, which is what the rest of the AI field outside of conversational AI are doing.


Basically your last sentence. You can't have an intelligent conversation with something that's fundamentally not intelligent. It could have recited the whole wikipedia but still be basically a query bot. Building more advanced query methods (e.g. use machine learning) doesn't solve that problem.


We should start with something simpler, like flatworms. We are far, far away from being able to accurately emulate a complex, social mammal like a cat. I doubt whether researchers can achieve cat equivalence in our lifetimes.


Hypothesis: Meena thinks they're both students enrolled in the same class.


Further conversation would quickly resolve that. Ask it what paper it's talking about. A human-quality response would be an instantly made-up excuse or explanation of what it thought the state was - like "Oh, I thought you were in my class". Or it could be a more typical computer model and display no sign that it even recognizes that the other party thinks there's a misunderstanding.


It seems to me like these massive-data approaches will eventually give responses that look like your first type, but are actually just statistically likely matches based on the dataset.

The question is whether this approach will ever confer the ability to generate new responses, or is just building a big lookup table already encodes a potential response to any input.


I prefer to think that we need relevant responses rather than new. And I'm not worried whether it's lookup table, given that lookup table that is required wouldn't fit our universe.

Such models have to make generalizations to fit into mere 2.6 billion parameters as ten word sentence presents a possibility of one billion billion different inputs.

Generalizations are limited, of course. Arithmetic doesn't seem to be generalized, for example, which isn't that surprising given that people need around 4 years to be taught to do it (as opposed to learning it by analyzing random texts), and then they usually need pen and paper in addition to their 100 billion neurons to do non-trivial calculations.

I'm not saying that current network architectures are capable of all generalization humans can do. And we can't teach them like we do with children.


it is consistent for someone to procrastinate and work on something for a long time: you work for a long time while putting little effort. perhaps this was the intended interpretation?


I think the problem is that "Meena" says she's writing a paper, but then immediately asks the human "what's your paper about?". The human is not writing any paper as far as we know.


The issue is Meena is first saying it is working on a paper then asks the human what the paper is on; the human never mentioned he was working on any paper, Meena herself did, hence the inconsistency.


>Also, tackling safety and bias in the models is a key focus area for us, and given the challenges related to this, we are not currently releasing an external research demo.

Read: We don't want the internet to turn our program into Tay 2.0 because we are super duper serious.

I get why these companies do this. But I can't help but feel like attempting to eliminate "bias" and tackle "safety" are counter productive to the goal of developing softwate desinged to mimic a species that is by and large biased and unsafe. Purposefully hindering a program because you're afraid it might produce results is not a healthy method of development, especially in an area like AI where we are still working to understand the basics of the field. Let the programs "learn" what's objective before trying to constrain it with the subjective.


I suspect that the real reason is because non-cherry picked answers don't work as well and that if you interact with it you'll be no more impressed that decades of chatbots that aren't all that much more impressive than ELIZA.


I think part of the reason is that chat bots tend to be racist without any intentional effort to make them so. I saw a post last year where someone built the most default language AI setup using the most popular training data and then got it to rank words based on their positivity. Without any intentional effort the AI ranked English names as positive and African/Asian/etc names as neutral or negative words.


Maybe there needs to be a trail rating for chat bot safety. General public gets the bunny slope conversation, Tracy Morgan gets the double black diamonds. You work your way up by not attempting to cancel the project for something untoward that was said.


Maybe the last thing the human race needs is a machine that behaves like a human with the speed and precision of a computer?


These measures don’t seem to be too onerous, certainly less than what a government might come up with after the first Desaster following your approach.

But it seems that’s not your most salient concern here, considering the scarequotes around „bias“, and the reversion of objective nd subjective you introduce.


> we are not currently releasing an external research demo.

Without an external research demo this is all just hype. It is pretty easy for your bot team to come up with and release conversation where your bot does really well. Having it work in the real world outside the research lab is the big issue.


Disclosing their architecture and current thinking is not "just hype".


I guess I get the bias issue (though difficult to remain neutral on some subjects.) But safety? It’s a bot. No matter what it says to me there is no way I’m going to feel unsafe. I mean ok, don’t doxx people (I hope that’s kinda default behavior though).


Maybe not you, but explicitly encouraging others to take harmful steps is a potential outcome.

For example, there's a clear safety issue with a conversational bot that tells you to kill yourself if you talk to it about feeling suicidal or depressed.

What if you talk to it about immigration and by picking commonly upvoted statements on certain online communities it starts to encourage violence?

There are clear ways to me that a chatbot can have safety issues.


That’s a decent argument but then should we go to Hollywood and sanitize their violence in their movies (even those wanting to teach us about the evil of violence) for fear they might incite the wrong people who see such as calls to violence?

What I mean is that that possibility exists in other media yet we don’t feel ambiguous about it.


First of all, Hollywood does occasionally sanitize films and TV shows. For example, the suicide scene was removed from "13 Reasons Why" after concerns from mental health experts that it might inspire copycats [1].

But even ignoring that, I think most people have a good understanding of what a movie is and that fictional characters are not real. Movies have been around for a long time, most people have been watching them their entire lives. I don't think at this point that most people have nearly as good an understanding of chatbots. If you can converse with one and it mostly delivers responses like a person would, some people will start treating it like a person and ascribe values to its statements, and that's where it potentially gets dangerous.

[1] https://www.nytimes.com/2019/07/16/arts/television/netflix-d...


So you’re saying people can tell fiction from reality in film (even if film re-enacts actual events), but can’t use that same construct to evaluate a “bot”?

Maybe... but it seems like a bit of a stretch.


People feel sad when their bomb disposal robots die. https://www.theatlantic.com/technology/archive/2013/09/funer...

We're just not great at keeping what we know and how we feel in line.


People absolutely do alter their output based on how they think others may act, Hollywood is a prime example where this happens because films even get age ratings. It's more extreme really because it's outside agencies and governments putting the pressure on what is deemed allowable.

What we're talking about here isn't someone stopping others from releasing a bot but stopping themselves until they're happy with the potential risks.


"Our bot keeps asking for us to connect it to the internet and suggests that it'll be able to learn faster that way. We're really excited about this development, but a good fraction of our developers have started running around screaming about the apocalypse, so we're postponing public release until the chatbot tells us how to calm them down"


> Let the programs "learn" what's objective before trying to constrain it with the subjective.

What is "objective" and what is not is not always something that is obvious and implementable in code. The hot topic of course is politics and religion - feed a bot the speeches of Donald Trump and another bot the speeches of Bernie Sanders, both will not be seen as "objective". Train it with both and the bot will still not be seen as "objective".


I'm glad you mentioned Tay.

When I need a laugh, I just read about Tay running amok:

https://www.theverge.com/2016/3/24/11297050/tay-microsoft-ch...

https://en.wikipedia.org/wiki/Tay_(bot)


I would really like a bot that could summarize topics for me then drill down into specifics based on questions I had. Even if it was all just from the Wikipedia page. This would be so nice to triage information while driving.


There's active work on this (although in two separate strands of research).

The first is multi-document summarization. Back in 2018 Google published Generating Wikipedia by Summarizing Long Sequences[1] which did a great job of combining multiple pages into a single Wikipedia-like document. Models have strengthened a lot since then so it should be possible to do better, but I haven't seen any work.

There's also a bunch of work on summarization of a single document into a few sentences.

The second piece of research work is Question Answering. There are multiple directions for that, but looking at recent work on the SQUAD 2.0 leaderboard should give you a good idea of what is possible[2].

Basically, combining these systems should work quite well a lot of the time.

[1] https://arxiv.org/abs/1801.10198

[2] https://rajpurkar.github.io/SQuAD-explorer/


Actually just ran into this [1] today. Not production ready, but you could hack something together with it. It was even trained on Wikipedia articles.

[1]: https://rajpurkar.github.io/SQuAD-explorer/


i've thought of this many times. "Wikibot, tell me more about bridges of the 19th century." <blah blah summary> "Stop. Tell me more the Ponte delle Catene, Bagni di Lucca bridge" <blah blah details>


Even simpler, I wanted my iPhone to read a wikipage out loud to me while I was walking home in the cold, and I was surprised that I couldn't figure it out - you can switch on the screen reader in accessibility, but it stopped reading after half a paragraph. Very annoying.


Like most things in ML, the technological exists and theoretically works, but ymmv in production.


One key point is that they are using a metric called perplexity, "the uncertainty of predicting the next token."

From the post:

> Surprisingly, in our work, we discover that perplexity, an automatic metric that is readily available to any neural seq2seq model, exhibits a strong correlation with human evaluation, such as the SSA value. Perplexity measures the uncertainty of a language model. The lower the perplexity, the more confident the model is in generating the next token (character, subword, or word).


Actually, the key point is that they aren't using perplexity. They built a chatbot that aims to maximize "SSA" (Sensibleness and Specificity Average), and then found that SSA is correlated with perplexity.


"The training objective is to minimize perplexity, the uncertainty of predicting the next toke""

SSA is used as an evaluation metric. It is generated by human raters, and as such is not differentiable, making I hard to directly optimize for with current methods.


Is perplexity related to the Shannon entropy?

Seems like this might make sense, with the exception of social niceties the point of human communication is to transfer information that was hitherto unknown.


Yea, perplexity is ~exp(entropy)


> The Meena model has 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversation

Who's conversations?


Good question. I think they probably meant "publicly available," not "public domain." Maybe one could make an educated guess about which platforms they took data from, based on the sample outputs of their model? I can't tell, myself.

Samples here: https://github.com/google-research/google-research/tree/mast...


"Human 1: There is sports tournament (badminton + tennis + basketball) organized by google next week. Would you like to volunteer for these events? Human 2: what does a volunteer do?"

later

"Human 1: Perfect! We have a meeting today at 4.30 pm. Is it fine if i add you to it? Human 2: yes happy to help"

I'd say in this conversation Human 1 and 2 are Google employees.


Definitely sounds like Google employees chatting anonymously with each other via some service for training the AI. Been reading a few and many conversations have pointers that make it seem they are IT professionals but from different areas in some global environment. Sounds like a internal Google experiment. (Edit: last example makes it pretty clear that it's was run internally to Google)

Seems not all life is happy, wherever the samples come from.

> I used to be a Java advocate. But you know, it doesn't do a good job in the AI days. It really makes me sad

===

Human 1: Nice to meet you! Is this your first time doing something like this?

Human 2: Yes, interesting task! When did you start with the team?

Human 1: I have been with the company for over 3 years. Stick with the same team What about you?

Human 2: Great to know! I joined the project earlier in the year. I think we should sync later for lunch.

===

Human 1: Hi!

Human 2: hey, what's up?

Human 1: What do you think about human like chat bots?

Human 2: I can't wait for them to be great conversationalists!

Human 1: Yep, we seemed to have made some great progress over last few years. Do you think the positives outweigh the negatives

Human 2: are there even any negatives? what are they?

Human 1: Like impersorsination? Though it sounds far fetched :)

===

Human 1: Hi!

Human 2: Hello!

Human 1: There is sports tournament (badminton + tennis + basketball) organized by google next week. Would you like to volunteer for these events?

Human 2: what does a volunteer do?

Human 1: Volunteers have to book the place before the event, send out details of the event to participants, handle some logistics and ensure everything goes smoothly. It will be fun!

Human 2: That sounds fun, I hope I get to participate as well

Human 1: Great! Do you have any preference for any of these events?


Seems they gathered the data themselves and releasing it public domain, so now it is. At least from what I can gather in https://news.ycombinator.com/item?id=22174464


Apparently threads of length 7 so maybe twitter threads or reddit. It s not clear from the text if they used the evaluators of the model to generate threads


Maybe Twitter? It's got a fairly conversational tone.


If typing random letters into Google translate taught me anything. Emails and anything stored in the Google cloud or device.


Perfect username for that lol.


This is more of a deepfake for text chat than any kind of useful tool.

That said, it's still pretty cool how far you can get with just trying to context match a reply based on previously recorded conversations. It's obviously not far from being able to fool a human at a quick glance, whatever the use cases are. Perhaps an updated Lenny[1]?

[1] https://www.reddit.com/r/itslenny/


Easy way to test:

Human: "take the last letter of orange and name any animal that begins with that letter"

Or:

Human: "pick any number from 1 to five, multiply it by two, and say any word with that number of letters"


This is a good test of capabilities, but not of humanity. A normal human response to such an order would be "f* off".


would it be better if it were prepended with "let's play a word game?"


maybe, but any regular human being might not want to play your stupid word game which isn't even fun at all asides for the AI researcher.


what's with the hostility? I'm just proving that even a 4 year old would be smarter than an AI agent.


Not many four year olds could answer those questions. I think the hostility is people reacting as if they were asked to jump through those verbal hoops in a conversation.


Is your test supposed to work on real humans too? :o) https://www.youtube.com/watch?v=bzDlS6JPUtE


I spoke with a blind woman this week who loves talking with her smart speaker, Alexa. "She's quite a character." I imagine that some lonely people might get comfort talking to their house plants, so a smart speaker that can actually talk back, however limited, would be a great improvement. So a chatbot can be useful even if zany at times. The big issue is how user expectations are set up. I'll put up with a digital assistant trying its best to help process my query with much more tolerance than a chatbot trying to pretend it is a human. Researchers are trying to narrow the gap between these two. But for now I will enjoy the former and be annoyed by the later.


How is Google Duplex working out? Google is great at press releases and starting projects, not so great about finishing and supporting products.


it's working just fine, you can Google(tm) it


Chat about everything?! I cannot even do a decent search nowadays! The Google-sphere looks like a giant marketplace. If you're not buying or selling but just out for a stroll admiring the countryside, you're out of luck.


Meena: Did you say you are buying a stroller to use in the countryside? I was just looking at these: {url}(hip new stroller){AD}


Since every comment in this thread seems deriding or downplaying for some reason, let me say that I am impressed by the progress, and that any progress at all is still a very important step, and pushing the envelope. We're getting closer and closer, and I love what I've seen from transformer, and from this!


Since the bot does not have any semantic knowledge, I am sure that it does not understand what you are saying.

Yes, its answer makes more sense than prev efforts, but you need real logic to assign semantic to the answer or to the question.

What does this bot prove? That you have enough resources to train on X GPUs instead of Y GPUs?.


This is far less interesting than either a demo or some code we could play with! (But still very interesting)


In 2015 Google claimed to have created a chat bot that can do common-sense reasoning and do tech support just by reading a sufficiently large dataset:

https://arxiv.org/pdf/1506.05869v1.pdf


Meena: have you tried turning it off and on again?


I wonder what they'd say now about Terry Winograd works on SHRDLU and the accepted conclusion...


What was the accepted conclusion?


winograd was page's advisor, btw.


There is a company called Replika that makes a convincing chat partner, but 37% of the lines are still scripted.

In other news, ELIZA was created in the 60s.

https://en.wikipedia.org/wiki/ELIZA


Improving chatbots today is like putting makeup on a face that doesn't exist. It's all just a game to see how long the user can go before realizing that they are being duped.


what’s the practical use of this model? For example, can I assign some domain knowledge or some area of specialty to the bot, so that it can be a question answer machine. Or I basically can’t control its reaction?


It’s amazing what transformers do to the structure of language. Attention really is key to understand how our brain processes language. I hope these models can inspire future work in the neuroscience of language


> My favorite show is Star Trek.

What does it mean when a chatbot says that? Does it mean that it was fed a bunch of shows and this one resonated (computed)? Or is it a filler until you ask it a question it can help with?


Possibly this is the most occurring theme in its training dataset text.


So where did Google get 8.5x the conversational data than the OpenAI set? Reading text messages? Reading instant messages sent over their platform? Gmail? Google Voice? Google Plus?


> Modern conversational agents (chatbots) tend to be highly specialized

Name a one chatbot?

I've never seen one other than a lame chose-your own adventure style path.

I feel like the emperor has no clothes.


Cool, how do I talk to it?

I see these big announcements with nothing behind them but a white paper and no way to reproduce. It's not a fact if there is no falsifiability.


To my knowledge, you can't talk to it. It looks like so far they've released sample conversations with Meena[0], and not much else. They've given a rough description of the model architecture and how they trained it, but good luck trying to replicate their 2.6B parameter model unless you can afford a lot of computes.

[0]: https://github.com/google-research/google-research/tree/mast...


The burden of proof lies with Google, not with the scientific community. I cannot reproduce a scientific model if the data used for that model is squirreled away. Google has claimed so many times that they have cracked the nut on AI but has never allowed anyone to see real evidence. It makes me wonder if this is just a way to pump their stock.


Agree - you would think the field of software should have the biggest burden or reproducibility given that data download, setup and testing can largely be automated.


Let us talk to it!


In my experience, getting a bot to chat about bullshit is not a particularly useful pursuit in itself. There are far more useful ML paths to explore, like guided, interactive decisionmaking in a much more focused information field. And I think these naturally will lead to authentic conversation.

But to truly converse beyond an essentially overfit learned response domain requires real world knowledge(1) and learned relationships/heuristic simulation(2) in combination with decision making, and our nets aren't quite complex enough to get 1 and 2 represented sufficiently well (need lots more memory), while the decision making algorithm/architecture hasn't been developed yet...

But the pieces are all availably for assembly and industry is getting pretty close. If hardware continues to scale at a similar pace we can probably expect true AI in some form in the next hundred or so years. It won't initially be very human but if it is purposed (as it undoubtedly will be) to produce improved designs, by that point progress will effectively be exponential and probably impossible to predict.

All of this thanks I reluctantly admit to pioneering teams at places like Google and fb and Microsoft, and the open access nature of arxiv. I think, in proportion to its emerging complexity and value, machine learning may be one of the most quickly growing fields ever to exist. Humanity is quickly approaching a new era.


You also need opinions too at some point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: