You may have seen my recent post about [Chatistics: a Python tool to parse your Messenger/Hangouts/WhatsApp/Telegram chat logs into DataFrames](https://news.ycombinator.com/item?id=22069699).
This notebook uses the exported chat logs to train a simple GPT/GPT2 conversational model! It uses Google Colab, a notebook platform that allows you to train complex models online for free.
The approach is super simple: it takes all your chat logs, turns them into this format:
> <speaker1> Hi
> <speaker2> Hey - how are you?
> <speaker1> Great, thanks!
> ...
...then simply trains a GPT model on this corpus. In practice, I found that the default parameters (including using GPT and not GPT2) give the best resources for this setup.
I got a bit tricked by the title here on HN. Maybe we can replace `talk` with `write`? Thought this was something that could learn how I speak and could generate sound from that, but seems to just be able written language, which is not nearly as interesting (for me).
Yeah, Microsoft has had NN-based speech generators that can mimick your own voice for about a year now. Thought this was going to be a competing service.
I'm disappointed that this is about typed text rather than actual talking - I had hoped that training something that talked like me might assist technology vendors in actually creating voice recognition technology that works for me.
And yes my problems with voice recognition are probably due to my Scottish accent.... ;-)
I've been playing with training different sizes[0] of gpt on my own chat data precisely for this reason.
Coincidentally, today I was even planning to publish my last post and notebook for training gpt2-1.5b and then chatting to oneself with the model. I left it for tomorrow though.. Maybe a mistake.
There is quite a lot you can do and talking to my trained model which is responding to me as me can be real weird at times. It's definitely the most engaged Ive been with gpt while talking to myself.
Having said that you seem to train here on very little. Still - cool demo.
I would be very curious to see your notebook - while this simple approach works well with GPT, we are not getting the results we'd want with a more complex question/answer model that uses GPT2. So I'd love to see your implementation details!
> Having said that you seem to train here on very little.
The datasets provided in the notebook are really meant to be fallbacks for people who are not willing to use their own chat log data. When training on my own data, I have about 500k messages, which starts being enough to get interesting results.
edit: wow, I see you're training on "14M facebook messages", that's impressive - do you actually chat that much?!
I just pushed it, the blog post (which includes the notebook) is here[0].
It's 14mb of data, not sure how many messages it actually comes down to but FB Messenger has been my main platform for talking to friends for the last decade.
I run a discord which is just a collection of people from my local city, with no real fixed subject or agenda. I trained the 345M GPT-2 against the "general" channel, and then set up a discord bot such that every message has a 2% chance that the past 5 messages will be read in as context, and a couple of sentences spat out in response.
It's sometimes very lucid, sometimes insane, but most of all it's very entertaining.
As a complete aside, I also tried 'transfer learning' against a huge amount of Marxist literature and then a small amount of erotic fiction, but that experiment hasn't quite worked out. Sexy Robot Marx will have to wait.
This is cool - might be worth training a simple discriminator model to identify your utterances, and then you can use the plug-and-play language model (PPLM - https://github.com/huggingface/transformers/blob/master/exam...) to generate utterances modeling a specific speaker without special tokens. Could also take less time to fine-tune.
A computer trained to talk like me would spend a lot of time swearing and whining about how it can't take it anymore, which I admit would be pretty funny.
I’ve never used PyTorch before... is this running within my local machine, or is there some API in here that’s also sending data to Google to also train their models? Asking a privacy point-of-view..
The python notebook is hosted on google colab which will execute on free (for you) google servers. If you’re concerned about privacy probably do not upload your personal chat logs. You could also download the notebook and install resources on a machine you control. There looks like alternative datasets to test for Obama and movie dialogues
When I was a teenager I wrote a very graphic and very disturbing work of fiction that was archived on a popular erotica text website.. I have had anxiety for many years now that eventually someone will glue the authorship of that story to my identity.. If people in my real life discover my fantasies from years back because of my writing signature, I do not want to guess where that will leave me.. I am not looking forward to the future!!
Along these lines, I worked on a team project in a university course to create an automated Q&A system making use of IBM Watson. We chose to focus on a Q&A system for business regulation in the state of Illinois. However, just using existing FAQs isn't sufficient. To build a corpus, we scraped several websites belonging to the state of Illinois for any information that would be relevant to businesses operating in Illinois. Then, we created sample question-answer pairs, with answers taken directly from the corpus. Using both the provided QA pairs and the rest of the unlabeled corpus, Watson trained a model to answer questions that hadn't been trained on by providing excerpts from the corpus. By ensuring that the model was providing excerpts from the corpus, we wouldn't have to worry that we were providing (too much) incorrect information; most of the time, the answers were relevant, too. Of course, you could create a similar system without using proprietary IBM software.
Maybe we're already there. Example: I've got a friend who worked in tech support long enough that he built a soundboard of recordings of his voice asking typical tech support questions in response to user problems.
I really don't send many emails through Gmail, but when I do it is INSANELY accurate in its suggested sentence completion. Sometimes simple stuff like an address or whatever, but it can get really creepy when I'm sending something to my wife as a reference for some bill or interaction with our landlord and it knows exactly what I'm trying to say after just a word or two (sometimes, something like "Hey, I just..." and it has the rest of the sentence ready to go).
This notebook uses the exported chat logs to train a simple GPT/GPT2 conversational model! It uses Google Colab, a notebook platform that allows you to train complex models online for free.
The approach is super simple: it takes all your chat logs, turns them into this format:
> <speaker1> Hi
> <speaker2> Hey - how are you?
> <speaker1> Great, thanks!
> ...
...then simply trains a GPT model on this corpus. In practice, I found that the default parameters (including using GPT and not GPT2) give the best resources for this setup.
This notebook will be part of our workshop "Meet your Artificial Self" happening this Saturday at AMLD 2020 in Lausanne, Switzerland: https://appliedmldays.org/workshops/meet-your-artificial-sel...
Feedback is welcome! :D