> The model also struggles to make a recognisable reconstruction when the scene is very low contrast, especially with faces.
It could be getting this wrong if his error function is calculating linear data from the given image pixels, which are in the totally not linear sRGB colorspace. That would make it badly underestimate any error in a dark image.
Quick check of the PIL docs doesn't mention gamma compensation, so they probably forgot about it. People usually do.
Actually, it would be even worse if it was run on gamma-corrected (proportional to light intensity) pixels. The sRGB space is designed to approximate human perception, which is only advantageous for this application.
The autoencoder converts an image to a reduced code then back to the original image. The idea is similar to lossy compression, but it's geared specifically for the dataset that it's trained on.
According to the defaults in the code, it uses float32 arrays of the following sizes:
image: 144 x 256 x 3 = 110,592
code: 200
Note that the sequence of codes that the movie is converted to could possibly be further compressed.
Point taken! I have edited the article now, there has obviously been some confusion and was an oversight on my part not to have explained that properly.
Correct me if I haven't looked into this closely, but one glaring problem is that all the results are from the training set. So it's not surprising you get something movie-ish by running the network over a movie it was trained on; the network has already seen what the output of the movie should look like.
Seems like his goal is regenerating the training set (with lossy recall) rather than predicting never before seen frames. When I hear old songs I haven't listened to in years I can still remember the next line, just based on the previous line. That's a lot of data that's being encoded somehow in my brain.
He only trained it on Blade Runner, I think. (This is doable because a single movie has a lot of frames.) So all of the other movies should be out of sample.
I'm a little confused by the article: it appears to me that the input to the neural net is a series of frames, and the output is a series of frames? So it works as a filter? Or is the input key-frames, and so the net extrapolates intermediary frames from keyframes?
[ed: does indeed appear from the github page, that the input is a series of png frames, and the output is the same number of png frames, filtered through the neural net. No compression, but rather a filter operation?]
What I found most interesting was that A Scanner Darkly, which is rotoscoped, looked live action in several of the coherent frames that had been filtered through his Blade Runner-trained network.
Correct me if I am wrong, but it is not so much "reconstruction" as "compression". (Or I got it wrong. Or the description is utterly unclear: reconstruct what from what.)
If it is the compression case, I am curious for the size of the compressed movie.
I'm a developer who knows nothing about AI but is fascinated by the recent painting/music/"dreaming" applications of it.
What would be some good resources for 1. getting the bare minimum knowledge required for using existing libraries like Tensorflow 2. going a bit further and having at least some basic understanding of how most popular ML/AI algorithms work ?
What is the difference between using a neural network to do this and using a filter that obtains the same or similar effect by distorting the frames of the input randomly?
I guess I feel like there's no practical result here. It's only interesting from an aesthetic point of view.
After reading the article, I'm still not sure what the purpose of the training is. If they're trying to reconstruct a film from stills, it seems like a failure since it looks like they wind-up with all sorts of swirly stuff rather than, say, the original film.
If they're trying to create interesting swirly stuff, where do they intend to go after that?
I mean, sure it's aesthetic though not on the level of weirdness of deep dreams modification.
I am not an expert, but: This approach creates powerful embeddings for images. It can convert an image into embedding space and vice-versa, generate images back from embeddings. It is built to function like a perception and imagination module. The embeddings are much lower dimensional and the latent variables are disentangled. There is a component for "has glasses" which you can flip and get the same image + glasses, for example. It is obvious this would be very useful in building all sorts of classifiers, image generators and agents (because agents need to compute reward over state and action space and disentangled representations of the state space are good for this task).
Um, are you sure we are there yet? For me it seems that the only atoms it has learned are the long-lasting static scenes, rather than eyes or mouths. Maybe it is just a matter of scoring and can be improved, but still...
It's a cool thing this guy did.. It would be interesting to see how small the files are that are generated in this process.. Just low pass filtering the video like somebody else suggested would probably achieve a similarly lossy image.
I guess what I take away from this is that maybe the way we store info in our brains looks kind of like this? Kinda fuzzy versions of the real thing?
Would be interesting if somebody one of these days can actually reconstruct to a high level of fidelity what our brain is "seeing".. I bet it would look kind of like this..
on a side note, I never heard this voice over version before. I am only used to the Harrison Ford voice over. This one lets me understand why many didn't like the VO.
Now back to the article, can someone explain about how many passes before it gets to near film quality? Can it extrapolate missing frames eventually?
The article lacks some details (I guess many can be found in cited papers), but it definitely seems to be a giant step toward usable large scale image analysis (providing a meaningful description). Maybe this could benefit google's new cpu...
Absolutely. The point of an autoencoder is dimensionality reduction: boil a big set of data down to a few hundred or thousand numbers in a vector which summarizes it. You could treat it either as lossy compression and store just the encoding, or you can treat it as a hybrid format in which the autoencoder lossy encoding is then corrected to lossless by additional bits in the stream.
In practice, even the hyper-efficient compression algorithms used in something like zpaq tend to use only very small shallow predictive neural networks because no one wants to wait days for their data to be compressed or ship around big neural nets as part of their archives, so it's more of an information-theoretic curiosity. Few enough people will even use 'xz'.
Last I checked, PAQ only uses a shallow (two layer) neural network as a last step to weight the predictions from the multiple handmade next-bit prediction models it contains.
I was intrigued by this a while back. I think of training a NN as generating a function, an equation, from a training set which, given a specific input, outputs a prediction. If you can come up with an equation-input pair that, when executed, a) accurately enough approximates some data, and b) requires less space than the original file, you have achieved (most likely lossy) compression.
This is too powerful to use for product work, I think.
Professionals like to use something they can understand, and when making a BD or streaming source they know what a compression artifact looks like and which frames' bitrates to tweak to hide it. They pretty much sit there all day and just do that.
It could be getting this wrong if his error function is calculating linear data from the given image pixels, which are in the totally not linear sRGB colorspace. That would make it badly underestimate any error in a dark image.
Quick check of the PIL docs doesn't mention gamma compensation, so they probably forgot about it. People usually do.