Glow: Better Reversible Generative Models

liquidise · on July 9, 2018

Ethics/Implication question: How much time is left until video/photographic evidence ceases to be a reliable factor in courts? I know that while projects like this are happening, so are projects that work to identify fakes. If the former manages to outpace the latter, this could become a real problem. My mind tends to think of a dystopian extension of cops planting drugs on suspects, where they could literally invent photographic evidence.

pixelHD · on July 9, 2018

With deepfakes this was being discussed. When a group of redditors managed to create a network architecture to put celebs in porn with surprisingly decent results (in some cases), I think a sponsored research group can get better results.

- https://www.brookings.edu/blog/order-from-chaos/2018/05/25/t...

- https://www.msnbc.com/hallie-jackson/watch/fake-obama-warnin...

- https://techcrunch.com/2018/06/04/forget-deepfakes-deep-vide...

windowshopping · on July 9, 2018

Would it be possible for phones/cameras to embed cryptographic information in video/photo files in such a manner that proved their origin, and which was tied to the pixels in image/video?

I don't know if that's a silly suggestion.

throwawaymath · on July 9, 2018

Sure, you could design a camera that signs photos with a private key stored in a hardware security module, then only trust signed photos and videos. It would have to be resistant to physical tampering and side channel attacks. One way to do this would be to stream bytes to the HSM as they're recorded, which would then output the bytes and their signature for local storage.

The tricky bit here is closing the analog loophole (using this camera to record a carefully constructed, high resolution fake) and preventing the HSM from signing anything which wasn't recorded by the camera lens.

drdeca · on July 9, 2018

If GPS satellites were to cryptographically sign their signals, would that be sufficient to prevent spoofing of GPS signals (assuming the receiver of the signals treated the signing appropriately)?

Like, I can imagine maybe there could be an attack where one could record gps signals at some nearby place, and then play them back in slightly different orders/rates to try to fool a receiving device into thinking it is somewhere else (nearby).

But I don't know how much of a time delay would be needed in order to do that. Could a timestamping service do timestamping quickly enough to prevent this attack for internet connected devices?

I tried looking at how the signals for GPS work (like, how frequently it sends time information and how detailed it is for civilians) but it seemed complicated and I got confused and/or didn't try hard enough, so I didn't arrive at an answer for how long it would take to spoof positions if one could only delay real signals that one received.

Edit: the purpose of making the gps coords unspoofable would be to make it so that even if the screen-in-front-of-camera attack was done, it would have to be done at the same location and time as it would be claimed to have happened.

darkmighty · on July 10, 2018

> If GPS satellites were to cryptographically sign their signals, would that be sufficient to prevent spoofing of GPS signals (assuming the receiver of the signals treated the signing appropriately)?

Nope, classic replay attack case (just record the GPS signal and replay to the device at the desired location). You'd need a true time signal within the device, e.g. and atomic clock to make it work (so you'd authenticate the signed time against true time).

---

There is another way, however. If we assume the hardware is tamper-proof (unless drastically different methods are needed), then with strict timings we can device a challenge-response system that's immune to replay attacks due to relativity: simply transmit a signal A, have a known 3rd party (e.g. US government servers in cellphone towers) sign your signal Sig(A) and retransmit, and check that the delay matches the propagation delay you'd expect from the cellphone tower distance, plus the fixed (and immutable since it would be gov-controlled) processing delay. Your tamper-proof crypto-camera would record its location and whether it trusts the location. Using cellular signals is also better because GPS doesn't work indoors and is sensitive to interference.

Since we're adding a cellular connection to our device, it would also be a good idea to log its position on the state-controlled servers (again can be done with cryptographic safety assuming non-tampered device), along with some kind of intrusion detection system. As soon as it'd detect an attempt at tampering, it would relay this attempt to the servers, storing the intrusion and invalidating the authenticity subsequent recordings; a self-destruction attempt of the key would probably be also wise.

---

And now that I think about it, you'd probably want to put several keys/auths in the device, from different organizations -- not only governments. That way if the government authentication is positive but NGO's mismatch, you can suspect a government-backed forgery attempt (analogously vice-versa).

Donald · on July 9, 2018

Sounds easy to thwart via key extraction from the hardware or simply recording a high-res screen playing the neural-network generated image or video.

subcosmos · on July 9, 2018

If a signature was located in the least significant bits of the pixels, you couldn't recover by just filming. If you altered the video, you would need to re-sign.

monocasa · on July 10, 2018

I think he's suggesting that you play the video for the camera, and then let the camera sign what it saw.

subcosmos · on July 11, 2018

Gotcha

I guess in the case of faking political speeches and the like, it becomes some trust model regarding who owns the camera.

ci5er · on July 10, 2018

I don't know that I agree with all of their architectural choices, but this company has developed a first-crack at that very thing: https://www.snapshotdna.com/

wilde · on July 9, 2018

It wasn’t reliable to begin with: https://www.aclu.org/other/whats-wrong-public-video-surveill...

cookingrobot · on July 9, 2018

Even fakeable evidence can be useful. If a hair is found at a crime scene today, you have to take into account the fact that it's possible (if unlikely) that someone could have planted it there. We'll probably end up in a similar situation when you pull video from a security camera, and you'll have to weigh the possibilities of it being real or fake, even if you can't tell either way.

aalleavitch · on July 9, 2018

Right: falsifying evidence leaves its own trails, and the likelihood of something being falsisfied can be investigated.

Cyphase · on July 10, 2018

"ceases to be a reliable factor in courts" and "dystopian extension of cops planting drugs on suspects" seem to be mutually exclusive. Also, as I commented on a HN thread 6 months ago[0], let's remember that there was no 'objective' record of events for the vast majority of human history. It's only in the last 150-200 years that we've had light and sound recording technology.

[0] https://news.ycombinator.com/item?id=16014047 [Google claims near-human accuracy at imitating a person speaking from text]

scottlegrand2 · on July 9, 2018

Still some time yet, try recovering young Geoffrey Hinton from the photo from 2011 or so. However, once we can only train for and tweak such knobs, it's going to get real IMO.

DoctorOetker · on July 9, 2018

Can someone please explain to me what a "1x1 convolution" is?

I've been wondering ever since I first started reading about 1x1 convolutions a while back.

My background is not in artificial neural networks, but I understand their single neuron operation: a linear combination of inputs (optional extra input or offset), so this part behaves like any linear correlator, and then a nonlinear but typically differentiable compression sigmoid.

I understand how convolutional neural networks operate, and that the synaptic weights correspond to filter kernel weights (like point spread functions, or impulse responses).

Given this engineering-like interpretation I have, can someone explain to me what use convolution with a 1x1 filter has??

QuadmasterXLII · on July 9, 2018

The trick is that convolutional neural nets operate on multichannel images. Your intuition that the synaptic weights form point spread functions is correct, but you have to remember that the activation of each neuron depends on a sum of point spread functions, one for each channel in the previous layer.

I like to think of 1x1 convolutions as pixel wise dense layers

Here’s a toy example of a useful 1x1 convolution: you could convert a color image to greyscale by doing a 1x1 convolution with the kernel (.33, .33, .33)

DoctorOetker · on July 9, 2018

thank you very much, I suppose we can all agree that 1x1x3 (Width x Height x Color) convolution would have been a lot clearer.

again, thanks for succinct and clear explanation!

daenz · on July 10, 2018

Just to elaborate a bit more, 1x1 convolution is often used to reduce the number of parameters from the previous layer to keep the total number of parameters manageable.

DoctorOetker · on July 10, 2018

yes, thank you, I understand the motivation, the reason I never understood before was quite simple, the description "1x1 convolution" (correctly) described that they were not doing convolutions across pixels.

It is good to point out what they aren't doing, they should also point out what they are doing.

Humor me by considering:

"Today I did not go to the zoo, I did not go to school, ..., and I did not go by foot, nor by vehicle without weels, nor by a vehicle with only one wheel, nor by a vehicle that had 3 or more wheels, and the vehicle was not motorised, and I did not need to stand up on the vehicle"

While true, and possibly important to point out what I didn't do, it's generally more helpful to describe what I am doing, like "I rode my bicycle to the supermarket"

brunt · on July 9, 2018

Manipulating Neil deGrasse Tyson's face in the demo yields some hilariously bad-looking results.

omgwtfbyobbq · on July 9, 2018

If you max out smiling and beard, the result kinda looks like Tim Meadows. Granted, that could be just me being yet another annoying white person. I'll ask my wife about it later just in case.

dnautics · on July 9, 2018

Interesting thing about the default image, "beard" on a woman creates a more typically "masculine" jawline but doesn't add facial hair.

Increasing blondeness also increases smile.

NegatioN · on July 9, 2018

# EDIT: it seems the post below me is actually correct, and my post is incorrect in this case.

Just as an aside here: "Blondeness" and "beard" are probably just labels the authors found correspond the most to the latent variables in this case. This means that there won't be a perfect translation between those words and what these variables directly respond to in the network.

So although the training data may have been biased with more smiling blonde people, it doesn't necessarily have to have been so. It might be that what this latent variable encodes just does something else in edge cases where there are few examples.

matheist · on July 9, 2018

Not so. They split the data between "bearded faces" and "non-bearded faces" and compute the vector from one to the other, and use that to alter a given face. (It reminds me of the typical word2vec example of man - woman + queen = king.)

See their code snippet halfway down their page, or "Semantic Manipulation" on page 8 of their linked paper.

nerdponx · on July 9, 2018

This sounds like yet another appearance of the tank-sky problem.

blt · on July 9, 2018

There is no mechanism in this paper (or in a standard VAE, or GAN) to encourage that a single human-understandable semantic quantity should be captured in a single dimension of the latent code. So, in general, it won't happen. "Blondeness" is spread out over all dimensions of the latent code. Therefore the method in this paper (summarized by matheist) is totally reasonable. There are other autoencoder-type generative models that try to concentrate each attribute in one dimension, usually by using the class labels as an additional input. But that is not the focus of this paper.

hanrelan · on July 10, 2018

Do you have a link to a paper that does this by any chance? I'm interested in learning more but not sure what to search for.

subcosmos · on July 9, 2018

Sounds like the approach used here : https://arxiv.org/abs/1609.04468

eli_gottlieb · on July 9, 2018

Increasing blondness made a photo of me look like Draco Malfoy -- and I have dark brown hair! Everyone in our lab is trying it.

Our negative so far is that most of the interpolations really seem more like two existing pictures were photoshopped together than like a new face was generated from a latent space and knowledge of faces. Sorry, I don't have the vocabulary and concepts of visual composition to say why. It just looks "shooped".

homarp · on July 21, 2018

Code is now available: https://github.com/openai/glow

girlpower32 · on July 9, 2018

Glow is also the name of the machine learning compiler that's built inside PyTorch: https://github.com/pytorch/glow/

tw1010 · on July 9, 2018

I mean, it has potential. But lot's of things have potential. I think I'll be more interested in a few years, but right now it mostly just looks too weird (not even uncanny valley) to feel like it could be used for anything.

chris_mc · on July 9, 2018

Reminds me of the Ocean's 12 (13?) scene where the brothers are in the boring machine and are manipulating the stolen FBI mugshots to disguise the group members before they get found out.