Show HN: Train a language model to talk like you

MasterScrat · on Jan 21, 2020

You may have seen my recent post about [Chatistics: a Python tool to parse your Messenger/Hangouts/WhatsApp/Telegram chat logs into DataFrames](https://news.ycombinator.com/item?id=22069699).

This notebook uses the exported chat logs to train a simple GPT/GPT2 conversational model! It uses Google Colab, a notebook platform that allows you to train complex models online for free.

The approach is super simple: it takes all your chat logs, turns them into this format:

> <speaker1> Hi

> <speaker2> Hey - how are you?

> <speaker1> Great, thanks!

> ...

...then simply trains a GPT model on this corpus. In practice, I found that the default parameters (including using GPT and not GPT2) give the best resources for this setup.

This notebook will be part of our workshop "Meet your Artificial Self" happening this Saturday at AMLD 2020 in Lausanne, Switzerland: https://appliedmldays.org/workshops/meet-your-artificial-sel...

Feedback is welcome! :D

prophesi · on Jan 22, 2020

I definitely need to give this a whirl. Does it use Python 2 or 3, and is it as simple as importing its ipynb to run it in a local Jupyter notebook?

capableweb · on Jan 21, 2020

I got a bit tricked by the title here on HN. Maybe we can replace `talk` with `write`? Thought this was something that could learn how I speak and could generate sound from that, but seems to just be able written language, which is not nearly as interesting (for me).

moron4hire · on Jan 22, 2020

Yeah, Microsoft has had NN-based speech generators that can mimick your own voice for about a year now. Thought this was going to be a competing service.

thedirt0115 · on Jan 21, 2020

How about this combined with recent advances in text-to-speech like Deep Voice that can sound like you? (Edit: punctuation)

arethuza · on Jan 21, 2020

I'm disappointed that this is about typed text rather than actual talking - I had hoped that training something that talked like me might assist technology vendors in actually creating voice recognition technology that works for me.

And yes my problems with voice recognition are probably due to my Scottish accent.... ;-)

Tenoke · on Jan 22, 2020

I've been playing with training different sizes[0] of gpt on my own chat data precisely for this reason.

Coincidentally, today I was even planning to publish my last post and notebook for training gpt2-1.5b and then chatting to oneself with the model. I left it for tomorrow though.. Maybe a mistake.

There is quite a lot you can do and talking to my trained model which is responding to me as me can be real weird at times. It's definitely the most engaged Ive been with gpt while talking to myself.

Having said that you seem to train here on very little. Still - cool demo.

[0] https://svilentodorov.xyz/blog/gpt-345M-finetune/

MasterScrat · on Jan 22, 2020

I would be very curious to see your notebook - while this simple approach works well with GPT, we are not getting the results we'd want with a more complex question/answer model that uses GPT2. So I'd love to see your implementation details!

> Having said that you seem to train here on very little.

The datasets provided in the notebook are really meant to be fallbacks for people who are not willing to use their own chat log data. When training on my own data, I have about 500k messages, which starts being enough to get interesting results.

edit: wow, I see you're training on "14M facebook messages", that's impressive - do you actually chat that much?!

Tenoke · on Jan 23, 2020

I just pushed it, the blog post (which includes the notebook) is here[0].

It's 14mb of data, not sure how many messages it actually comes down to but FB Messenger has been my main platform for talking to friends for the last decade.

0. https://svilentodorov.xyz/blog/gpt-15b-chat-finetune/

MasterScrat · on Jan 23, 2020

Awesome :D

We've published the third notebook: https://colab.research.google.com/drive/1XYNef9zcHhTjt6kM6yd...

We would gladly have your feedback on our approach

hug · on Jan 22, 2020

I run a discord which is just a collection of people from my local city, with no real fixed subject or agenda. I trained the 345M GPT-2 against the "general" channel, and then set up a discord bot such that every message has a 2% chance that the past 5 messages will be read in as context, and a couple of sentences spat out in response.

It's sometimes very lucid, sometimes insane, but most of all it's very entertaining.

As a complete aside, I also tried 'transfer learning' against a huge amount of Marxist literature and then a small amount of erotic fiction, but that experiment hasn't quite worked out. Sexy Robot Marx will have to wait.

perturbation · on Jan 21, 2020

This is cool - might be worth training a simple discriminator model to identify your utterances, and then you can use the plug-and-play language model (PPLM - https://github.com/huggingface/transformers/blob/master/exam...) to generate utterances modeling a specific speaker without special tokens. Could also take less time to fine-tune.

the-dude · on Jan 21, 2020

I totally missed that Lyrebird was acquired : https://news.ycombinator.com/item?id=21006405

data_ders · on Jan 21, 2020

My curiosity is tempered by the fact that I've seen this episode of Black Mirror before... :)

https://en.wikipedia.org/wiki/Be_Right_Back

ferCats99 · on Jan 21, 2020

I think is more like White Christmas

bryanrasmussen · on Jan 21, 2020

A computer trained to talk like me would spend a lot of time swearing and whining about how it can't take it anymore, which I admit would be pretty funny.

raidicy · on Jan 21, 2020

This is part of a workshop series[0]. Does anyone know if the talks/shops will be recorded?

[0]https://appliedmldays.org/workshops

MasterScrat · on Jan 21, 2020

We don't have any plan to record it currently.

But we will release the two other notebooks used during the workshop, and plan to write a blog post detailing the full content.

raidicy · on Jan 21, 2020

That sounds great! This and the main workshop site have many subjects I'm really interested to check out. Thank you!

thisisastopsign · on Jan 21, 2020

I’ve never used PyTorch before... is this running within my local machine, or is there some API in here that’s also sending data to Google to also train their models? Asking a privacy point-of-view..

heybrandons · on Jan 21, 2020

The python notebook is hosted on google colab which will execute on free (for you) google servers. If you’re concerned about privacy probably do not upload your personal chat logs. You could also download the notebook and install resources on a machine you control. There looks like alternative datasets to test for Obama and movie dialogues

woefulregret · on Jan 22, 2020

throwaway, duh.

When I was a teenager I wrote a very graphic and very disturbing work of fiction that was archived on a popular erotica text website.. I have had anxiety for many years now that eventually someone will glue the authorship of that story to my identity.. If people in my real life discover my fantasies from years back because of my writing signature, I do not want to guess where that will leave me.. I am not looking forward to the future!!

fudged71 · on Jan 21, 2020

Could you train this on a Q&A/FAQ corpus and get somewhat relevant results? (And is there any better tool for doing this?)

cyorir · on Jan 21, 2020

Along these lines, I worked on a team project in a university course to create an automated Q&A system making use of IBM Watson. We chose to focus on a Q&A system for business regulation in the state of Illinois. However, just using existing FAQs isn't sufficient. To build a corpus, we scraped several websites belonging to the state of Illinois for any information that would be relevant to businesses operating in Illinois. Then, we created sample question-answer pairs, with answers taken directly from the corpus. Using both the provided QA pairs and the rest of the unlabeled corpus, Watson trained a model to answer questions that hadn't been trained on by providing excerpts from the corpus. By ensuring that the model was providing excerpts from the corpus, we wouldn't have to worry that we were providing (too much) incorrect information; most of the time, the answers were relevant, too. Of course, you could create a similar system without using proprietary IBM software.

MadWombat · on Jan 21, 2020

Oh, oobee doo

I wanna be like you

I wanna walk like you, talk like you, too

You'll see it's true someone like me

Can learn to be like someone like you

alfonsodev · on Jan 21, 2020

This is going to be useful for when we fully turn into cyborgs.

pjmorris · on Jan 21, 2020

Maybe we're already there. Example: I've got a friend who worked in tech support long enough that he built a soundboard of recordings of his voice asking typical tech support questions in response to user problems.

whatshisface · on Jan 21, 2020

They did that on the show "IT crowd."

nickster · on Jan 21, 2020

I wonder if they are using this in Android Messenger or Gmail for the suggested responses.

neodymiumphish · on Jan 21, 2020

I really don't send many emails through Gmail, but when I do it is INSANELY accurate in its suggested sentence completion. Sometimes simple stuff like an address or whatever, but it can get really creepy when I'm sending something to my wife as a reference for some bill or interaction with our landlord and it knows exactly what I'm trying to say after just a word or two (sometimes, something like "Hey, I just..." and it has the rest of the sentence ready to go).

heybrandons · on Jan 21, 2020

Thanks for sharing MasterScrat! This looks fun!

brainzap · on Jan 21, 2020

Train it on Fred Rogers