A TensorFlow Implementation of DeepMind's WaveNet Paper

jonnycowboy · on Sept 14, 2016

Do you know how long it took to train using that dataset and with what hardware configuration?

teddyknox · on Sept 14, 2016

Doing a forward pass for every sample sounds like it would be prohibitive for real-time applications.

nicklo · on Sept 14, 2016

It absolutely is. DeepMind reported that 1 second of audio generation takes about 90 minutes to generate.

throwawaymsft · on Sept 14, 2016

Assuming it's computation bound, it's a factor of 5400 (~13 doublings in CPU power required to get to real-time, assuming no algorithmic improvements).

mdsteph · on Sept 15, 2016

If I'm not mistaken, it seems that the current limitation is that it needs to be produced sequentially for a dependent sequence of audio, perhaps some independent sentences can be run simultaneously using copies of the net assuming no memory limitations. I wonder if it's already possible to create an auidobook for instance in reasonable time.

mattnewton · on Sept 15, 2016

Do they mention it was CPU trained? I assumed GPU. If it was CPU trained, I wonder what the operations keeping it off the GPU were?

Houshalter · on Sept 15, 2016

Google has special neural net ASICs now.

ogrisel · on Sept 15, 2016

Google never stated they use those to train models as far as I know. It seems that they are primarily used to spare energy when deploying trained models at scale.

Houshalter · on Sept 15, 2016

Theres no reason they couldn't use them to train, as long as they can account for the lower precision operations. I think it would be much cheaper to train on them, at that scale anyway.

dharma1 · on Sept 15, 2016

Afaik the Google TPU does inference only, at 8 bits. I don't think it's possible to train a neural network at 8 bit precision at this point in time. FP16 works for training though, and is twice as fast as FP32 on certain nvidia chips

Houshalter · on Sept 15, 2016

Backpropagation can work with any precision, as long as you use stochastic rounding (so that the rounding errors are not correlated.) Without stochastic rounding even 16 bits will have rounding error bias.

http://arxiv.org/abs/1412.7024

dharma1 · on Sept 15, 2016

OK. I was going by this - https://petewarden.com/2016/05/03/how-to-quantize-neural-net...

I haven't seen 8bit training implemented in any (public) frameworks yet - that's not to say it's not possible. If it works then that's great, especially for specialised hardware.

jcannell · on Sept 15, 2016

That doesn't imply they can run WaveNet yet - for inference this net is sort of worst-case serial. Their TPU ASIC is almost certainly highly parallel, like a GPU - actually has to be that way for energy efficiency (which is it's claimed benefit).

Wavenet actually looks like it could possibly have been designed to run on CPUs in production, at least after they can further optimize it some. Sampling is super slow right now because it requires an enormous number of tiny dependent TF ops and thus kernels that have huge overhead for tiny amounts of work. A custom implementation could probably circumvent that by evaluating all the layers sequentially in local cache on a fast CPU.

Or they just designed it without much concern for production plausibility yet.

Houshalter · on Sept 16, 2016

I'm not sure how this algorithm is serial. The neural net layers still involve huge convolutions that can all be done in parallel.

jbpetersen · on Sept 15, 2016

Building an ASIC for it would be another option to speed things up on the computation side.

Itsdijital · on Sept 14, 2016

Was that in the paper? I was looking for a source for it last night but couldn't come up with it

nshm · on Sept 15, 2016

Why would a honest researcher mention downsides of his work in a paper. No, it was on twitter https://www.reddit.com/r/MachineLearning/comments/51sr9t/dee...

confluence · on Sept 15, 2016

https://news.ycombinator.com/item?id=12463263

Looks like the source deleted their tweet.

lallysingh · on Sept 15, 2016

Can we just use 90 cores?

rryan · on Sept 15, 2016

Unfortunately no, see Amdahl's Law.

https://en.wikipedia.org/wiki/Amdahl%27s_law

arcanus · on Sept 15, 2016

If we did, it is not likely that the strong scaling is perfect.

c3534l · on Sept 15, 2016

We're still a couple of papers before we the computation down to a reasonable amount. Or eventually Moore's Law will take care of it. It might have applications that aren't real-time, too. I'm writing a video game in my spare time, and I was wondering how I would do the foley. If I could feed in some sounds and synthesize a library of sound effects that all sound different enough that it won't be repetitive to hear that same exact footstep sound for the entire game, then I consider that a win. So, you know, this is cutting edge research we're talking about.

alexbeloi · on Sept 14, 2016

I had plans to do the same, I'm glad somebody beat me to it.

Kenji · on Sept 15, 2016

That is an admirable mindset - I cannot help but be a bit frustrated when someone independently implements my ideas first. On the one hand, it validates the idea. On the other hand, building it would be fun, but now that someone else did it, doing it again would be akin to reinventing the wheel and it is more productive to turn your attention to something new (unless you think you can execute the project much better).

vintermann · on Sept 15, 2016

In this case, it's DeepMind's idea anyway :)

I get more disappointed when the opposite happens. I think something like, "Yeah, I'm totally going to add support in torch for noisy activation functions like in this paper!" (https://arxiv.org/pdf/1603.00391.pdf). Then I procrastinate and put it off. Then I think, "No matter, someone else has surely done it by now". Then they haven't.

sherjilozair · on Sept 15, 2016

This project (and others in different frameworks) are far from complete. The prize for reproducing WaveNet in open source remains unclaimed.

blaurence5 · on Sept 15, 2016

Here's a Theano implementation: https://github.com/huyouare/WaveNet-Theano

Dowwie · on Sept 14, 2016

I wonder whether accents could be layered on top of the trained data?

aab0 · on Sept 14, 2016

Probably. You can add in 'speaker' as a bit of metadata to the samples (this is what is meant by 'conditioning on') and teach it to speak like different people, so if you have a diverse sample of speakers and you add in 'accent' as another variable, it might well learn to disentangle individual speakers from their accents and then you can control generated accents by changing the metadata.

sargun · on Sept 15, 2016

I'm really interested in building an RNN and training it against something like movies. I'd love to then take it and input a song and translate it to a music video that's a composite. I'm also interested in the legal ramifications of doing such a thing...

Does anyone know of prior art?