It's unfortunate that there aren't any high quality projects for the other way around - text to speech.
By high quality I mean something as good as Tacotron 2 [1]. However, I have no illusions of Google ever releasing code or trained models.
Someone will likely re-implement it from the paper, but it's unlikely to be anywhere near as good as those samples, due to worse tuning and lack of a good training set[2].
This effect can be seen if you compare Baidu's samples [3] for Deep Voice 3 with those from an implementation by a third party [4].
[2]: Common voice, for example, will not work well, since you want lots of data from a single speaker for good results. See for example, how samples get progressively worse when increasing the number of speakers in training data[3].
Tacotron 2 was trained on 24 hours of single-speaker transcribed audio, which is comparable to the freely-available LJ Speech Dataset. We know that it's feasible to train using unaligned transcribed speech, which broadens the opportunities to reuse an existing corpus. (Aside: does anyone know the legal status of training a DL model on copyrighted content?) Tacotron 2 was trained on a 32 GPU cluster, which is large but not absurdly so; we are seeing drastic performance increases in low-precision compute, which should hopefully start to trickle down as Volta reaches the market.
Expertise is a bigger challenge at this stage, although that is progressively changing. A huge number of developers are taking a serious interest in deep learning, so hopefully we'll start to see more active contributors to DL FOSS projects. The teams at DeepMind and Baidu Research are clearly highly skilled but relatively small, which suggests that their efforts could be replicated by a small but determined team of FOSS developers.
It's not just about the amount of data, though. The speech and audio quality of Google's TTS data is likely better than the audio contained in the LJ dataset (disclaimer: I've only listened to samples contained in the latter, which have some audible reverberation). Ideally, you'd use a professionally trained speaker and record them in a (semi-)anechoic chamber.
By high quality I mean something as good as Tacotron 2 [1]. However, I have no illusions of Google ever releasing code or trained models.
Someone will likely re-implement it from the paper, but it's unlikely to be anywhere near as good as those samples, due to worse tuning and lack of a good training set[2].
This effect can be seen if you compare Baidu's samples [3] for Deep Voice 3 with those from an implementation by a third party [4].
[1]: https://google.github.io/tacotron/publications/tacotron2/ind...
[2]: Common voice, for example, will not work well, since you want lots of data from a single speaker for good results. See for example, how samples get progressively worse when increasing the number of speakers in training data[3].
[3]: http://research.baidu.com/deep-voice-3-2000-speaker-neural-t...
[4]: https://r9y9.github.io/deepvoice3_pytorch/