Hacker News new | past | comments | ask | show | jobs | submit | audiohermit's comments login

I'll push back on this. The quality of the read speech should be a higher concern than having parallel data. Unless OP's wife is a teacher or actor/voice actor, if LibriSpeech transcripts are boring, it will come out in the speech.

I think OP would ideally want the model to pick up on more natural intonation, instead of monotone dictation. Record everything from now on, as best you can with similar recording context, and hopefully that data will be enough to cover more natural nuances.


Hey, speech ML researcher here. Make sure you have different recordings of different contexts. fifteen.ai's best TTS voices use ~90 min of utterances, some separated by emotion. If you're having her read a text, make sure it's engaging--we do a lot of unconscious voicing when reading aloud. Tbh, if she has a non-Anglophone accent, you're going to need more because the training data is biased towards UK/US speakers.

If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes.

There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons.

There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi.

Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data.


I came here to say this. My brother has a PhD in chemistry and no coding experience. He was able to create a voice model of himself using basic nvidia example generators in a week. My dad lost his voice and it would have been very nice to have a TTS that was much more close to him. I personally would think it would be worth it to have that database.

But obviously also attend to the human matters as well, eg spend time.


I work in pathological speech processing/synthesis so I'm unfortunately familiar with your father's position. It really sucks that these people didn't know that archiving their voice would've been useful. I hear snippets that people manage to glean from family videos right after listening to their current voices and it makes me really sad.

On the upside, your father can choose any celebrity he wants to voice him! Tons of celeb data is publicly available (VoxCeleb 1 & 2).


Are there any simple howtos anywhere which describes the process in as simple terms as possible? Without knowing the cool toolkits du jour.

Something like: - Download these texts - Record in WAV at least 48 kHz - Record each line in a separate file. - Do 3 takes of each line: flat, happy, despair

Maybe even a minimal set and a full set depending on how much effort you are willing to put in.

A plain description on how to capture a raw base which within reason and technology could be used as a baseline for the most common toolkits.

I have myself looked into this (for fun) but I felt I needed a very good understanding of the toolkits before even starting to feed in data. And for my admittedly unimportant use it seemed a huge investment to create a corpus I was not even confident would work. I ended up taking the low road and used an existing voice.


Not really, this is the only thing I know of in terms of collection: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/24... Usually you're basing your recipe off of those for existing datasets (TIMIT, WSJ, LibriSpeech, etc).


For recording training audio:

https://github.com/daanzu/speech-training-recorder

The recorder works with Python 3.6.10. Need to pip install webrtcvad also.


Is Morgan Freeman the most used celebrity?


A ranking of used voices would be fascinating. Especially broken down by user statistics.


I'd go for Stephen Hawking, myself.

(Not using his voice synth, reconstructed using ML, because it should sound more natural that way ;-)


I recall that the "say" program on the SGI from the mid 90's was approximately Hawking's voice. Hawking gave his speech for the Whitehouse Millennium Lecture at SGI also, and while I wasn't able to attend I found the transcript of it and fed it in there... there were some jokes that he had that only really came through with the intonation and pacing of a voice synth -- its the ultimate dead pan voice.

https://clinton.presidentiallibraries.us/items/show/16112 https://youtu.be/orPUQm1ZRSI

And his voice was his - even with the American accent.

https://www.news.com.au/technology/innovation/why-stephen-ha...

> “It is the best I have heard, although it gives me an accent that has been described variously as Scandinavian, American or Scottish.”

> ...

> “It has become my trademark and I wouldn’t change it for a more natural voice with a British accent.

> “I am told that children who need a computer voice want one like mine.”

Somewhere, I recall a NOVA(?) program from the mid 80s where it showed him using the speech synthesizer and the thing that he said with it that still sticks in my mind is the "please excuse my American accent". In later years he was given the opportunity to upgrade it to a more natural sounding voice - but that voice was his.


Near the end of his life, his original voice computer started to fall apart. He managed to get in touch with the people who wrote the software, who started a mad scramble to find source, and ultimately ended up emulating the whole setup on a Pi.

https://theweek.com/articles/769768/saving-stephen-hawkings-...


Hawking’s original voice synth was the default sound of the DECtalk hardware speech synth.

It would not surprise me if SGI’s software implementation were similar to the he most popular hardware of the 1980s.


Unfortunately dad passed 21 years ago. But the options now are much better. Just projecting my past experiences on the obvious Delta.


Which generator works the best, qualitatively? I come from a vision/ML background but haven't played with speech at all, so it's completely new to me, and wondering what the state of the art is.

I've been wanting to create a TTS of myself so I can take phone calls using headphones and type back what I want to say so that I don't have to yell private information out loud in public locations. Would be nice if during non-COVID times I could sit in a train seat and take phone calls completely silently.


Much of the work in speech synthesis has been about closing the gap in vocoders, which take a generated spectrogram and output a waveform. There's a clear gap between practical online implementations and computational behemoths like WaveNet. As you implied it's hard to quantitatively judge which result is better, papers usually use surveys to judge.

Here's a recent work that has a good comparison of some vocoders: https://wavenode-example.github.io/

Edit: WaveRNN struck a good balance for me in the past but is not shown in the link. Tons of new work coming out though!


WaveRNN (and even slimmer versions, like LPCNet) are great, and run for a tiny fraction of the compute of the original WaveNet. Pruning is also a good way to reduce model sizes.

I'm not sure what's up with the WaveGLOW (17.1M) example in the linked wavenode comparison... The base WaveGLOW sounds reasonable, though. They're also using all female voices, which strikes me as dodgy; lower male voice pitch tracking is often harder to get right, and a bunch of comparisons without getting into harder cases or failure modes makes it seem like they're covering something up.

(I've run into a bunch of comparisons for papers in the past where they clearly just did a bad job of implementing the prior art. There should be a special circle of hell...)


Agreed. I didn't have a better comparison at hand.

I'm looking at you GAN papers.


This sounds pretty cool (your brother making the voice model, not your dad losing the voice)...do you have a link to this example? I would love to play with this.


I have an open source web service for rapidly recording lots of text prompts to flac: https://speech.talonvoice.com (right now the live site prompts for single words because I’m trying to build single word training data, but the prompts can be any length)

You can set it up yourself with a bit of Python knowledge from this branch: https://github.com/talonvoice/noise/tree/speech-dataset

There are keyboard shortcuts - up/down/space to move through the list and record quickly.

If you want to use it on arbitrary text prompts, you can modify this function to return each line from a text file: https://github.com/talonvoice/noise/blob/speech-dataset/serv...

If you use this, before recording too much, do some test recordings and make sure they sound ok. Web audio can be unreliable in some browsers.

The uploaded files are named after the short name, so make sure you can correspond the short name with the original text prompts, eg with string_to_shortname().

If you aren’t easily able to do this yourself, I’d be happy to spin up an instance of it for you with text prompts of your choosing.


Somewhat OT question: after taking a quick look at this I stumbled on the eye-tracking video you made using pops to click... and I'm curious, can eye trackers not detect and report blinking?

Also, I noted the VLC demo says it doesn't use DNS! That's awesome...


I can detect blinking, yes. However your eyelids have very small muscles that are not meant to be consciously controlled all day and I don’t recommend straining them. Your eyelids twitching from muscle strain is rather uncomfortable (from experience)

The VLC demo was using macos speech recognition. In the beta now I’m shipping my own engine+trained models based on Facebook’s wav2letter, which is going pretty well.


Also buy a (half) decent mic! They're much cheaper than you might expect.


Seconding this, it's worth a hundred or so for at least a Yeti or something, considering it's not like you'll get another chance to do this.


Also, in a similar vein as testing backups, make sure to test + listen to the recorded audio, if you can.


I have about 500 hours of high quality, channel isolated (separate from the person I was speaking to) audio. It comes from my podcast that I have done for many years. It's probably closer to 75-100 hours audio of me actually speaking, since I am more the interviewer.

Is that something that would be useful to a researcher in any context? I am intrigued by the idea of having my voice preserved (you know, ego), but also am happy to donate the sound files if they would help researchers in any way for datasets.

If so: chris@theamphour.com


Do you have transcripts, even just for some of the episodes? Unsupervised learning is possible but more difficult.

In general, yes, this is probably useful data in some way for speech recognition or TTS.


Make sure to get recording of true honest laughter of hers too


Depending on your desired level of hedging...

I would say also consider recording a variety of honest utterances of all kinds, situations, and emotions. Anger outbursts, apathetic grunts, sexual even if you so desire (hence throwaway account)... Please dont be offended by this, just thinking of all scenarios for you to decide for yourself...


Came here to post this. Glad someone else thought of this first!


Thanks so much for the resources and the well thought out reply!


I wrote a simple little Python GUI app to record training audio. Given a text file containing prompts, it will choose a random selection and ordering of them, display them to be dictated by the user, and record the dictation audio and metadata to a .wav file and recorder.tsv file respectively. You can select a previous recording to play it back, delete it, and/or re-record it. It comes with a few selections of sentences designed to cover a broad diverse range of English (Arctic, TIMIT). Pretty simple and no-nonsense.

https://github.com/daanzu/speech-training-recorder

Originally intended for recording data for training speech recognition models [0], it should work just as well for recording to be used for speech synthesis.

[0] https://github.com/daanzu/kaldi-active-grammar


Did you figure out why half of Shervin’s audio was empty? I would hesitate to recommend this if there’s still a chance half of the data isn’t usable after recording.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: