Hacker News new | past | comments | ask | show | jobs | submit login

I wrote https://trumped.com

You ideally want five hours of clean speech (good microphone, no background noise, high sample rate). It should be spoken clearly, in a single tone or mood. My model sounds awful because the data isn't consistent, and the room tone and microphones are terrible.

If you want different prosody or moods, don't mix them in the same data set.

You can experiment with transfer learning LJSpeech with Nvidia Tacotron2 right now. Glow-tts is also promising.

You'll start to get results with fifteen minutes of sample data, but for high quality you want a lot of audio.

Have your wife read a book and record it. The training chunks will be ~10 seconds apiece, so keep that in mind for how to segment the audio.

Focus on getting lots of good sounding data. Hours. The models will improve, but this may be your only shot of acquiring the data.

Download the LJSpeech dataset and listen to it. See how it sounds, how it's separated. That is a fantastic dataset that has yielded tremendous results, and you can use it for inspiration.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: