Abstract: Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a systematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi-imperceptible to the human eye. We further empirically analyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals important geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines potential security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images.
Super interesting. I'm on mobile and haven't had time to read the whole paper yet - would it be feasible to continuously compute these perturbation vectors during training and include them as part of a larger heuristic? For instance, to incorporate the objective of maximizing the size of the perturbation vector necessary for misclassification? The goal being to end up with a net that is more resistant to such perturbations.
Short answer: No. Computing these perturbations requires an expensive optimization with multiple passes through the dataset, and this would be prohibitively expensive to do in the inner-most loop of training.
There are other work in the literature describing faster algorithms to compute these perturbations, which makes it possible to use them while training. See, eg.: https://arxiv.org/abs/1412.6572
IMO, (at least) two pieces of research on the subject means that the short answer really is "yes". Maybe not the exact technique used in the paper in the original post, but conceptually similar techniques.
It's easier to find fooling perturbations of one image, but not of the whole dataset. I assumed the question was can we use the universal perturbations for robust training? The answer to that is still "no", I think.
This seems to imply the features lernt by neural networks are very different from the features humans use to distinguish the same objects because they are affected by distortions that do almost not interfere with features used by humans at all.
One thing is neural networks are much smaller than human brains and most likely have far few overlapping redundant systems. If you had three separate neural networks that called on a consensus you might find it much harder to find adversarial inputs.
This reminds me of signal attenuation/gain/feedback (by neurotransmitter release [I want to say dopamine...]) due to error in the visual cortex.. Hopefully someone who's studied that might have something to share.
I believe the parent was referring to having an ensemble of models with different trained networks can reduce variance and perhaps avoid issues like these.
Attention, localized gain, etc would not have this effect, but they tend to allow a smaller network to perform more sophisticated tasks.
We have two 'cameras' and they scan the image they are looking at by jumping around the image 20–200 ms intervals. The perceived image is integration of many of these jumps and its' constantly changing.
Several of the universal perturbation vectors in Figure 4 remind me a lot of Deep Dream's textures.
I wonder what it is about these high-saturation, stripy-spiraly bits that these networks are responding to.
Is it something inherent in natural images? In the training algorithm? In our image compression algorithms? Presumably, the networks would work better if they weren't so hypersensitive to these patterns, so finding a way to dial that down seems like it could be pretty fruitful.
My intuition is that these patterns "hijack" the ReLU activations in the lower levels, causing either important features to not fire or features that shouldn't fire to do so. Usually the lower layers learn very primitive shapes like lines and curves, and I think (although I'd need to double check) that they usually pass through entire color channels rather than nuanced mixings of colors. (So one features would either pass through all of red or all of blue or all of both, rather than pass just 66% red, 47% blue, and 33% green -- if it did the latter it wouldn't be able to generalize well) This propagates the error through the network, where the later activations start firing in the wrong places, causing the mis-classification.
> The surprising existence of universal perturbations reveals important geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines potential security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images.
The paper unpacks that explanation pretty well along with actual pictures and how they are related to the classification boundary.
This is really great research and interesting: (very roughly) how to compute a very small mask which, when applied to any image, makes the neural network misclassify it, whereas humans would notice no essential difference.
I'm not an expert but it seems these perturbations are fiddling with the fundamental approach used with a NN. Mainly, that a NN works in layers. So these perturbations must be messing up the lowest layers and then the higher layers end up generating the wrong features and ultimately the model misclassifys. See http://i.stack.imgur.com/jpYdN.png
This is why I'm never driving a car that is classifying stuff with neural networks. Some dust, some shitty weather conditions and that pigeon becomes a green light.
Ok, so some guy invents a device that tricks every car at an intersection into seeing a green light, and maybe blinds them to the presence of other cars.
In signal processing you often have to pass the data through some sort of low-pass filter before attempting your analysis. I would be surprised if that isn't one of the methods being tried to protect deep neural nets from some of these attacks. Obviously there are some issues (needing to train on similar data, and such blurring interfering with first-level features that emulate edge-detection and so on).
If my understanding is correct, the perturbations are inherent in the model, not the data. It's a vulnerability in the high dimensional decision boundary of n nets.
"Snowcrash" is the more realistic Neal Stephenson version where it gets at the eye-brain-embedded hardware. And of course the original, "the joke so funny that if read or heard would make you laugh yourself to death".
Humans seem really good at being imprevious to these, due to millions of years of ignoring things..
I'm guessing it won't be long until someone uses this technique to computer and apply perturbation masks to pornographic imagery and make NN-based porn detectors/filters (like the one Yahoo recently open-sourced) a lot less effective.
Is there reason to think the human visual system is sufficiently well modeled by deep neural nets that our brains might exhibit this same behavior? My first thought was the perturbation images would need to be distinct per person, but photosensitive epilepsy like the Pokémon event [0] might suggest the possibility of shared perturbation vectors.
that image also seems very black (the dog takes up most of the image) so the perturbations probably didn't have much to perturb. Also the perturbation is "universal" so it could have simply landed on the same classification.
My science-fiction brain is, of course, interested in this as a method to defeat face-detection in a way humans can't see. I'd like to think that the crew of the Firefly used this technology to avoid detection when they did jobs in the heart of Alliance territory.
Can someone help with a notation question? In section 4 of the paper, the norm of the perturbation is constrained to a maximum of 2'000 which presumably is "small" but I don't know how to parse an apostrophe like that
Update: later in the paper, the authors mention that 2x10^4 is an order of magnitude larger than 2'000 so perhaps this is just a way of introducing a thousands separator without introducing cultural ambiguity over whether it's a thousands separator or a decimal separator?
My intuition is that the existence of adversarial images with barely perceptible differences but a high-confidence misclassification will lead to a new NN architecture for image classification.