Recurrent Neural Nets for Speech Synthesis

julespitt · on Jan 18, 2016

Went looking for audio samples, here's some from one of the researchers:

http://www.zhizheng.org/demo/is15_mte/demo.html http://www.zhizheng.org/demo/dnn_tts/demo.html

raverbashing · on Jan 18, 2016

Also here http://homepages.inf.ed.ac.uk/zwu2/demo/icassp16/lstm.html

sawwit · on Jan 18, 2016

I thought this would be about text-to-speech applications, while this seems more like an encoder-decoder problem (make the network learn a pattern and then let it reproduce it). I'm wondering how long it is until we see working TTS based on LSTM RNNs.

ghayes · on Jan 18, 2016

Yeah, can someone explain the exact problem of "statistical parametric speech synthesis," since I can't find a general overview of the problem itself.

amelius · on Jan 18, 2016

I'm a newbie to all this, but I can imagine it could be useful for speech compression.

nicklo · on Jan 18, 2016

This paper focuses on statistical parametric speech synthesis (SPSS). SPSS is only 1/2 of the text to speech problem.

SPSS is the problem of going from linguistic features, phenomes, etc, to speech audio. These features are more or less golden, either derived from the audio itself or hand-labeled. So things like tonality, cadence, emphasis on words is already encoded as features which is why these samples sound so good.

Deriving these features from pure text is very hard, and this failing is the main reason most text to speech systems sound so dull and tone-dead.

That being said, these results are seriously impressive, sounding very natural. Would love to see someone try and train an end-to-end system from pure text to speech. I think we'd see some big improvements like what Baidu has done for end-to-end speech to text.

mempko · on Jan 18, 2016

The most interesting part of this paper is their simpler RNN structure than LSTM.

Moshe_Silnorin · on Jan 18, 2016

Slightly unrelated question, has there been any effort into hardware acceleration of such networks? How amenable are modern machine learning algorithms to hardware acceleration?

michael_h · on Jan 18, 2016

The GPU is pretty well optimized for the sort of operations an RNN needs.

There were a few efforts to make actual silicon neurons, plus the whole nueromorphic movement, but they were generally less than what people were expecting, slow, and difficult to interface with.

gcr · on Jan 18, 2016

I've seen some work that attempts to recreate the "spiky" neural networks (e.g. neurons that fire when their inputs pass a threshold), intended to mimic the biochemistry of real neurons.

That work seems to spin their contribution as reducing the power required to evaluate the neural network though. If I recall correctly, the accuracy of those models for everyday tasks is typically much much lower than usual ANNs, and they're a pain to train. So, still not very common.

michael_h · on Jan 19, 2016

That is exactly what I made circa 2008. I used the izhekevich model for spiking. It was certainly faster on the GPU (2000x), but, yeah, getting the network to converge on anything was terrible. Debugging it was fun/awful though.

1:"Hey, do you see the first squiggle with the two fuzzes after."

2:"Next to Beaker's eyebrows?"

The low power work seems to have been aiming to be a rough filter, rather than a full system. Still fun to use.