This can be used to implement seamless voice performance transfer from one speaker to another:
1. Train a WaveNet with the source speaker.
2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.
3. Record raw audio from the source speaker.
Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.
To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).
4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)
5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.
Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.
Another fun fact: this actually happens with (cell) phone calls.
You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.
Edit: not by using a neural net or deep learning, of course.
> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"
Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).
I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.
This also makes me think of "inverse problems", in the context of mathematics, physics.
E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.
The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.
Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.
Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.
"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."
These are the "inputs" I'm talking about recovering (from the link):
"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."
The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.
How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.
Lincoln died in 1865, but the oldest recordings are from the 1860s. The video is definitely a hoax (http://www.firstsounds.org/research/others/lincoln.php), but it's at least theoretically possible his voice could have been recorded. In fact I believe there are some even older recordings from the 1850s, but I don't think those have been successfully recovered yet.
These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.
It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.
> It seems like you're using WaveNet to do speech-to-text
I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).
In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.
I see. I still think it's easier to apply deepmind's feature transform on text rather than to try to invert a neural network. Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?
Text -> features -> TrumpWaveNet -> Trump saying your text
> Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?
Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.
Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.
Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.
(surjective != non-injective, in the same way that non-increasing != decreasing)
1. Train a WaveNet with the source speaker.
2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.
3. Record raw audio from the source speaker.
Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.
To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).
4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)
5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.
Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.