This can be used to implement seamless voice performance transfer from one speak...

itcrowd · on Sept 8, 2016

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

JonnieCache · on Sept 8, 2016

In case anyone is wondering, the technique is called linear predictive coding.

AstralStorm · on Sept 9, 2016

Predictive coding. Linear is a specific variant of it used in older codecs.

VikingCoder · on Sept 9, 2016

What's the difference in bandwidth?

svantana · on Sept 9, 2016

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).

I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.

shoo · on Sept 9, 2016

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.

https://en.wikipedia.org/wiki/Inverse_problem#Mathematical_c...

romaniv · on Sept 8, 2016

Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

erichocean · on Sept 8, 2016

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

swsieber · on Sept 8, 2016

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

erichocean · on Sept 8, 2016

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

stavros · on Sept 8, 2016

Really? I haven't trusted anything recorded in years.

copperx · on Sept 10, 2016

Speech production is incredibly hard to fake at the moment.

sangnoir · on Sept 11, 2016

> Speech production is incredibly hard to fake at the moment.

Sound-alikes have been used in the music industry since forever.

mirimir · on Sept 9, 2016

Or transmitted from one place to another :(

zardo · on Sept 8, 2016

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

infinite8s · on Sept 8, 2016

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

zardo · on Sept 8, 2016

Audio quality does leave something to be desired. https://vimeo.com/47987691

barrkel · on Sept 8, 2016

Lincoln died before Edison invented the phonograph. That's a hoax.

Houshalter · on Sept 9, 2016

Lincoln died in 1865, but the oldest recordings are from the 1860s. The video is definitely a hoax (http://www.firstsounds.org/research/others/lincoln.php), but it's at least theoretically possible his voice could have been recorded. In fact I believe there are some even older recordings from the 1850s, but I don't think those have been successfully recovered yet.

These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.

infinite8s · on Sept 8, 2016

Ah I stand corrected, thanks.

dhammack · on Sept 8, 2016

It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.

erichocean · on Sept 8, 2016

> It seems like you're using WaveNet to do speech-to-text

I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).

In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.

dhammack · on Sept 8, 2016

I see. I still think it's easier to apply deepmind's feature transform on text rather than to try to invert a neural network. Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Text -> features -> TrumpWaveNet -> Trump saying your text

erichocean · on Sept 8, 2016

> Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.

creshal · on Sept 9, 2016

Sounds like a very fancy way to do compression with a massive custom dictionary.

posterboy · on Sept 9, 2016

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

mdup · on Sept 9, 2016

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)