One very interesting thing about differential privacy is its connection to generalizability in machine learning. In particular, a model can only provide differential privacy if that model is completely generalizable. Generalizability is the ability of a model to do well on data outside its training set. Non generalized models violate differential privacy because you can find out if someone is in the training set by simply measuring the error rate on that person. A low error rate provides evidence that that particular person was in the training set.
This connection is interesting because it implies that differential privacy techniques applied to machine learning must in turn result in more generalizable models, which is very useful in general. This can also provide a simple "upper bound" on how differentialy private your model is. Simply compare the performance on the training and test set and see how much it differs.
Fun coincidence (or maybe not?), I just stumbled upon the following paper on arvix titled "Differentially Private Generative Adversarial Network" https://arxiv.org/pdf/1802.06739v1.pdf
Its Related Work section also references some works on "differentially private learning in neural network".
While the paper does not discuss directly OP remark about how differential privacy could help generality in machine learning, it shows that at least some connections exist between the domains.
I think what they are getting at is the following, which isn't a proof or anything (if you are looking for rules to follow, the much simpler "your model isn't differentially private" is probably even more accurate):
Let X be a set of iid samples drawn from a distribution D, let M be a model trained with eps-DP on the set X, and let x' be a fresh sample from D.
Let's think about the random variable M(x'), which represents your test loss or whatever. Differential privacy says that the density of this random variable is pointwise within a factor of exp(eps) of M'(x') where M' is the random model resulting from training on X \cup { x' }. But M'(x') is distributed per a random x', random X, and M' trained on both just like your training loss or whatever.
You can push linearity of expectations through all of this to show that your expected loss on test and training should vary by at most a factor of exp(eps), but I'm not aware of a pointwise test for a specific model (and it may not make sense, as DP isn't a property of specific models, but rather distributions of them).
Thanks, using DP to bound deviations of arbitrary functions of a model is a neat idea.
I wonder if it it makes sense going from generalization error to an estimate of something similar in spirit to (but not as strong as) differential privacy, as the top-level comment suggested?
For example I want to empirically argue that a particular function of my GMM, let's say the log-likelihood of x_i, is "private." To do so I form C iid train/test splits. For each split I estimate the density of the likelihood for train and test samples, and estimate an upper bound on their ratio. As a result I get C samples of epsilon, and I can use some kind of simple bound (Chebyshev?) on the probability of epsilon being within an interval.
The idea is that we already have some "privacy" [1] from the data-sampling distribution. So we don't necessarily need to add noise to our algorithm. And it would be interesting to measure this privacy (at least for a particular function of the model) empirically.
> you can find out if someone is in the training set by simply measuring the error rate on that person.
I don't think it's that easy to detect if a sample exists in the training set. Error rate are usually informative when measured across multiple predictions. With a single prediction you're most likely looking at the estimated probability / confidence (with Logistic Regression for example). In that case models might assign high probability to unseen inputs as well.
> This connection is interesting because it implies that differential privacy techniques applied to machine learning must in turn result in more generalizable models, which is very useful in general.
I'm confused. Wouldn't a model that produces completely random results be differentially private but not generalizable? Or is giving more random results considered generalization?
Technically, the output quality of a completely random model stays the same when you try to generalize. It wasn't actually good to begin with, but it doesn't get worse.
An upcoming talk [1] entitled "From Differential Privacy to Generative Adversarial Privacy" implies we can do better than DP:
...Our results also show that the strong privacy guarantees of differential privacy often come at a significant loss in utility.
The second part of my talk is motivated by the following question: can we exploit data statistics to achieve a better privacy-utility tradeoff? To address this question, I will present a novel context-aware notion of privacy called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to arrive to a unified framework for data-driven privacy that has deep game-theoretic and information-theoretic roots. I will conclude my talk by showcasing the performance of GAP on real life datasets.
Looks like a very nice introductory article. One warning is that implementing DP algorithms naively can lead to privacy violations. For example, the python code in the article:
I've seen this limitation brought up before, but I don't have an intuitive understanding of when it would apply. Is the essential issue spacing between floats?
The scenario I'm thinking in is each user's entry is a real number from U[1e15,1e15+1], and we want to output sum. The sensitivity is still 1. So the Laplace Mechanism outputs something like sum(DB)+laplace(0,1/eps), which will be exactly sum(DB) because laplace(0,1/eps) is smaller than the gap between floats >=1e15.
For this to be an issue, the actual sum(DB) needs to be implemented in full precision right? Otherwise it seems like the sensitivity isn't necessarily 1.
It points to an ethical problem with how Differential Privacy is expected to be used. You have to get everything 100% right or the secrets in your dataset will eventually be compromised. As someone publishing works incorporating differential privacy techniques, do you have enough confidence in your implementation to risk those secrets? What level of consent is required from those who are impacted if your implementation fails?
Not sure if those questions are rhetorical -- I agree with them, and I'm on the theory side and don't write these implementations myself, and I share your concerns. But they also apply to crypto in general.
We do need a way to get sensitive data sets to researchers, there is a lot of public good and discovery to be had when we do.
Cynically, when we were finding that de-identification of data was not a cryptographically secure or a viable solution for providing data sets to researchers who couldn't be trusted to protect them, a nagging voice would suggest k-anonymity and now DP were a way to obfuscate the distribution of risk in the solutions through layered niche abstractions, and use the ensuing confusion to get the data toothpaste out of the tube.
Today, I like the idea that DP provides a set of information theoretic criteria for candidate algorithms for protecting privacy and anonymity. It also appears to provide practical tools for technologists to reason about information theory problems in their day to day work.
It's hard to imagine it was acceptable to say, "sure you can do cancer/autism/health research, but first we need a viable generalized homomorphic encryption solution before we share our data with you," but that was the state of health information security up to even recently.
The need for a class of non-cryptological (e.g. not certified algos), and non-zero-sum information privacy protection tools reflects how we actually use data today.
I am still wary that project types will wave DP around like a talisman to ward off security analysts threatening their deadlines, but that's not DP's fault.
There is a joke in here about how the response of security people to DP and privacy is often snark, but I will leave that as an exercise to the reader.
FHE and DP aren't concerned with the same definition of privacy. FHE aims to guarantee exactly what is revealed. DP tries to guarantee what can be inferred from what is revealed.
Indeed, and I think the definition of privacy that security technologists were applying to health data (in my example) implied FHE or other solutions that were out of reach, as a barrier to doing important research.
It was worth emphasizing these are very different concepts. My association of them was because they are posed as potential solutions in the same business problem domains.
I wonder if Differential Privacy will be a solution to GDPR restrictions. If the user’s data is purturbed enough that it can’t be used to identify the user in it, do some of the restrictions the law mandates to its use reduce?
> Differential privacy formalizes the idea that a query should not reveal whether any one person is present in a dataset, much less what their data are.
I see one problem here: you almost never know all the possible queries in advance.
This connection is interesting because it implies that differential privacy techniques applied to machine learning must in turn result in more generalizable models, which is very useful in general. This can also provide a simple "upper bound" on how differentialy private your model is. Simply compare the performance on the training and test set and see how much it differs.