This is an incorrect description of what the algorithm is doing.
Two networks--a "generator" and a "discriminator"--play a minimax game where the generator maps random vectors into images, and the discriminator attempts to distinguish between these images and real images of celebrities. When the discriminator is (close to) optimal, image-gradients based on the features it uses to predict are passed to the generator, so that the generator can learn the distribution in feature space that makes up human faces.
> Images of real people have been scored along the adjustable metrics, and then as you click the +/- adjustments, the source images are very craftily "blended" to produce a finite spectrum of results.
There is no reason to believe the source images are actually embedded in the latent space.
The descriptors that you can edit in the gui don't necessarily span an orthogonal basis in that latent space, so some of them are correlated, which is why editing one value can change others. Additionally, there is no a priori reason to believe that the manifold of "human face-like images" of 628x1024 is 512-dimensional, so there are areas of the space that still don't map well to real images. The network's ability to cover this space is limited by the number of unique training images it sees, how long it is trained, and its architecture.
> The descriptors that you can edit in the gui don't necessarily span an orthogonal basis in that latent space, so some of them are correlated, which is why editing one value can change others.
I think both you and the author of the article are making the same mistake here. (Although at least you use "orthogonal" and "correlated," whereas the author calls nonorthogonal vectors "entangled" for some reason.)
If you have a nonlinear function f on a vector space, there's no reason why an orthogonal basis for that space will give a better parameterization than a nonorthogonal basis. Even if you have a linear function, there's no reason why that should make a difference.
(For example, take f(x,y) = (x-y,y). Then f(x,0)=(x,0) and f(y,y)=(0,y), so "correlated" input directions (1,0) and (1,1) are mapped to "independent" or orthogonal outputs.)
I think it is a bit of a mystery why Gram-Schmidt orthogonalization makes a difference here. Perhaps the author should experiment more with different inner products.
>I think both you and the author of the article are making the same mistake here.
Maybe?
>If you have a nonlinear function f on a vector space, there's no reason why an orthogonal basis for that space will give a better parameterization than a nonorthogonal basis.
I don't think I made that claim. Here's all I'm saying: To whatever degree the features of interest are linearized in the latent space (and there's really no guarantee that they are), we don't have any guarantee that those linear features are orthogonal to one another, so tuning the latent representation along one feature will also impact others.
> (For example, take f(x,y) = (x-y,y). Then f(x,0)=(x,0) and f(y,y)=(0,y), so "correlated" input directions (1,0) and (1,1) are mapped to "independent" or orthogonal outputs.)
That's true, but remember that the nonlinear mapping is from our latent space (spanned by uniformly random 512-element input vectors) to pixel space. We really don't care about linear algebra in pixel space. I have zero expectation that we would preserve orthogonality from latent to pixel space.
I don't think any part of the GAN objective requires that these interesting features actually be linearized in the latent space (obviously they are not in pixel space), but the approach is to use a GLM to find the latent vectors that best fit the features anyway. Whether or not the vectors you identify with the GLM really retain their semantic meaning through the latent space, they're also clearly not orthogonal, so changing the latent representation along one dimension also changes others.
This book is a good practical introduction that walks you through the basic ideas as you develop some basic functionality. http://neuralnetworksanddeeplearning.com/
I'm often pretty skeptical of e-books and self publications, but the above link is pretty good (and the video series linked here references it as well.) The Goodfellow book that another commenter mentioned is a high-quality survey of the field and a nice, high-level overview of different research directions in deep learning, but isn't as pragmatic as an introduction.
Two networks--a "generator" and a "discriminator"--play a minimax game where the generator maps random vectors into images, and the discriminator attempts to distinguish between these images and real images of celebrities. When the discriminator is (close to) optimal, image-gradients based on the features it uses to predict are passed to the generator, so that the generator can learn the distribution in feature space that makes up human faces.
> Images of real people have been scored along the adjustable metrics, and then as you click the +/- adjustments, the source images are very craftily "blended" to produce a finite spectrum of results.
There is no reason to believe the source images are actually embedded in the latent space.
The descriptors that you can edit in the gui don't necessarily span an orthogonal basis in that latent space, so some of them are correlated, which is why editing one value can change others. Additionally, there is no a priori reason to believe that the manifold of "human face-like images" of 628x1024 is 512-dimensional, so there are areas of the space that still don't map well to real images. The network's ability to cover this space is limited by the number of unique training images it sees, how long it is trained, and its architecture.