My wife will be undergoing significant oral surgery in a few weeks and there is a SMALL chance she may lose the ability to speak. I'd like to prepare, just in case, to have technology to reproduce her voice from keyboard or other input.
My ideal would be an open source "deepfake toolkit" that allows me to provide pre-recorded samples of her speech and then TTS in her voice. Unfortunately most articles and tools I'm finding are anti-deepfake. Any recommendations?
Fallback would be recording her speaking "phonetic pangrams" and then using her pre-recorded phonemes to recreate speech that sounds like her. I feel like the deepfake toolkit is the way to go. Appreciate any recommendations... There must be open source tools for this??
If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes.
There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons.
There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi.
Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data.