Hacker News new | past | comments | ask | show | jobs | submit login
Differential Privacy (seas.harvard.edu)
136 points by sr2 on June 23, 2017 | hide | past | favorite | 34 comments



This[0] video from Apple's WWDC gives a nice overview of how Differential Privacy is being used in iOS. Basically, Apple can collect and store its users’ data in a format that lets it glean useful info about what people do, say, like and want. But it can't extract anything about a single specific one of those people that might represent a privacy violation. And neither can hackers or intelligence agencies.

[0] https://developer.apple.com/videos/play/wwdc2016/709/?time=8... (the "Transcript" tab has the text of the video if you want to read instead of watch.)


Its cool they are using DP for some analytics. But its not quite the holy grail Apple and its fans has been selling it as. Because any analytics campaign using DP will always eventually average out to pure noise or end up being non-anonymous.

Heres a great interview from the ms researched that invented the technique http://www.sciencefriday.com/segments/crowdsourcing-data-whi...

One of the quotes I always liked from it is "any overly accurate estimates of too many statistics is blatantly non-private"


I like https://blog.cryptographyengineering.com/2016/06/15/what-is-... as an introduction.

Differential privacy is cool. However, I looked at Google's RAPPOR algorithm (deployed in Chrome, and clearly designed with real-world considerations in mind) in some depth, and I found that RAPPOR needs millions to billions of measurements to become useful, even while exposing users to potentially serious security risks (epsilon = ln(3), so "bad things become at most 3x more likely"). Much better than doing nothing, but we'll continue to need non-cryptographic solutions (NDA's etc.) for many cases.


The coolest part about differential privacy is its guarantees about over fitting.


We have some notes on using differential privacy to slow down over-fitting here: http://www.win-vector.com/blog/2015/10/a-simpler-explanation...


Oh I hadn't considered the statistical advantage here.

You do lose out on a lot of human bias in the research process, but you also create blind errors that are hard to validate.

I know in my work there is plenty of times I run analysis and go back and manually check some entries as a sanity check - pros and cons here!


The thresholdout method [0] for preventing overfitting on a test set is an interesting application of this.

Here's a talk on differential privacy applied to the overfitting problem [1]

[0] http://andyljones.tumblr.com/post/127547085623/holdout-reuse

[1] https://www.youtube.com/watch?v=9mqXjdnZA18


I think this is the canonical review article: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

(No, I haven't read it...)


Aaron Roth was my professor at Penn. He's definitely the expert on differential privacy. Fun fact: his dad won the Nobel Prize in Economics a few years ago.


I don't like differential privacy very much.

Take GPS data, for example: NYC has released a taxicab dataset showing the "anonymized" location of every pickup and dropoff.

This is bad for privacy. One attack is that now if you know when and where someone got in a cab (perhaps because you were with them when they got in), you can find out if they were telling the truth to you about where they were going -- if there are no hits in the dataset showing a trip from the starting location that you know to the ending location that they claimed, then they didn't go where they said they did.

Differential privacy researchers claim to help fix these problems by making the data less granular, so that you can't unmask specific riders: blurring the datapoints so that each location is at a city block's resolution, say. But that doesn't help in this case -- if no-one near the starting location you know went to the claimed destination, blurring doesn't help to fix the information leak. You didn't need to unmask a specific rider to disprove a claim about the destination of a trip.

I think that flaws like these mean that we should just say that GPS trip data is "un-de-identifiable". I suspect the same is true for all sorts of other data. For example, Y chromosomes are inherited the same way that surnames often are, meaning that you can make a good guess at the surname of a given "deidentified" DNA sequence, and thus unmask its owner from a candidate pool, given a genetic ancestry database of the type that companies are rapidly building.


The attack you suggest is ruled out by differential privacy. The precise guarantee is a bit complicated. The first thing to note is that the output of a differentially private mechanism must be random. Then, the guarantee is that Pr[output] does not change by very much whether or not you are included in the dataset. In other words, even if you were omitted from the dataset, there the chance that the algorithm produced the same result is very similar.

This definition rules out the attack you suggest. In particular, if you are removed from the dataset, then the probability of the output (i.e., a ride starts in the region) goes from very large to very small. Therefore, the algorithm you describe (i.e., adding noise to the start location) is not actually differentially private.

The confusion arises because oftentimes adding noise is sufficient. For example, the average of n real numbers in [0,1] is affected by at most 1/(n-1) if you delete one point from the dataset. Therefore, you can just add a little bit of noise and the dataset becomes differentially private.

For the dataset you describe, a sibling comment proposed the correct mechanism -- you have to add noise to the count returned by the query, not the start location. (Technically I think you could just add noise to the start location like you propose, but the amount of noise would have to be large enough that all the start locations overlap by a sufficient amount.)


Thank you! Makes sense.


It's been a couple years since I read the literature so I might be wrong, but iirc differential privacy fuzzes the number of matches instead of the data points themselves.

That is, in your example differential privacy would precisely display the start and end points but the number of riders would be fuzzed (perhaps showing 4 departures and 2 arrivals at the respective location/time).

Also, iirc, offline differential privacy can outright remove data points while online system can block sufficiently deanonymizing queries.

One of the criticisms of differential privacy is that it can render the data useless... I definitely found that held true for my data. In the end, my company simply decided against releasing (or collecting) any customer data.


Doesn't it also introduce noise into the dataset? It could modify rides at random, let's say that 10% of rides would have the start or end point swapped with some other ride or completely changed. That way the information in aggregate is still useful, but no one measurement is necessarily true and also you can not use the lack of a data point as a proof of anything.


> That way the information in aggregate is still useful, but no one measurement is necessarily true and also you can not use the lack of a data point as a proof of anything.

This is great, succinct, description of the goals of differential privacy.

> Doesn't it also introduce noise into the dataset? It could modify rides at random, let's say that 10% of rides would have the start or end point swapped with some other ride or completely changed.

It can. There are variants, and offline/online systems operate differently.


You seem to have an incorrect version of differential privacy in mind. "Blurring the datapoints" as you describe would not satisfy differential privacy at all.

DP would not allow an information leak of this kind, unless the data set was modeled in a very silly way.


> if there are no hits in the dataset showing a trip from the starting location that you know to the ending location that they claimed, then they didn't go where they said they did.

As others have pointed out, differential privacy is not obtained just by blurring the points. But I'd like to point out that what you wrote above is closely related to exactly the original definition of DP.

DP guarantees that you cannot get a statistically significant difference between a dataset with one element removed (e.g. a trip in your case). That implies that the difference is also insignificant if you add a point.

In other words DP addresses exactly your concern here.


Amen. The obscuring only works against known techniques.

So long as signal remains in the collective data, as the state of the art advances individual data streams will inevitably emerge.


At one point, I know someone who wanted to give money to a large medical organization so that they could show their patients the tradeoff between various interventions. (efficacy vs side-effects).

It was going to be donated money to build an app that belonged to the institution.

The institution would not let their own researches publish the data on the app even though it was anonymous. They didn't want to take the risk.

It would be great if this lead to accepted protocols that made it so that people didn't have to think about it. "Oh yeah, we'll share it using DP" and then people could move ahead using data.


Shades of the AOL search data leak:

https://en.wikipedia.org/wiki/AOL_search_data_leak

Of course personally identifiable information will be extracted despite this model. "Differential Privacy" is cynical academic malpractice -- selling a reputation so that when individuals are harmed in the course of commercial exploitation of the purportedly anonymized data, the organizations that profited can avoid being held responsible.

We never learn, because there is money to be made if we pretend that anonymization works.


To be clear, I think you're right that no tracking is better than trying to protect data.

However, it's important to understand that 'anonymization' is very different than the practice of "Differential Privacy."

I'm no expert but here is how I understand it as a simplified example:

Imagine your information is stored in a spreadsheet. It is storing your weight, height, zipcode, age and name.

The 'anonymization' spreadsheet would still have a unique row dedicated to you (similar to a spreadsheet) and it may replace your name with an ID# or an encrypted string. Now, just like in the AOL dataleak that information being stored as a single line item is still easy to backtrack as there is likely no one else with your weight, height and age combination in your zipcode. So a hacker can identify a single person.

Differential Privacy would store information differently, perhaps in separate spreadsheets, one that is list of heights, one that is a list of weights, etc, etc. No two spreadsheets would store the information in the same order (#3 on the height list would not be #3 on the weight list) and it may even contain some incorrect dummy information.

There would be some sort of algorithmic relation however that allows a system to create outputs in which the data has meaningful information (trends, means, standard deviations etc) but it can not be back-tracked to identify any single unique row.

Differential Privacy allows us to see the trend "Males age 45 are taller on average than Females age 45" but not say "User #155083 is age 45, weighs 195lbs, and lives in zipcode 10001"

That's a big difference in privacy, and while it isn't perfect it is a step in the right direction. While I wish more companies would adopt a no-data policy, it is at least better that they are responsible as can be with the data they have.


Such obscuring is vulnerable against sidechannel attacks to re-link the record fragments. For example, misspellings, incidental geographic information, topics -- any pattern which is unusual, not anticipated by the modeler and deliberately obliterated.

What links the AOL fiasco and this one is that both believe they have thought of everything important. They're wrong -- and there will always be a future attacker to prove it. You can't fight information theory.

Differential Privacy is an excuse to get around sensible no-data policies -- by making irresponsible promises, it will result in more privacy violations, not less.


> You can't fight information theory.

I totally agree with this statement, but I think you are confused about differential privacy. Its guarantees are information theoretic (specifically, a bound on the relative Bayes factors of any conclusion, with and without any individual record).

You are obviously welcome to be skeptical, but much of what you've posted so far is not correct.


Of course I don't dispute the math. I maintain that these guarantees will not be achieved in practice because they rely on impossibly airtight implementation and impossibly omniscient modeling.


Interesting. Do you have similar concerns about cryptography?

Edit: to more strongly bind 'similar': would you also say of cryptography that it is "cynical academic malpractice"?


I fully expect privacy disasters based on imperfect implementations of Differential Privacy. Do you not? Do the researchers not?

The superior alternative is to avoid sharing sensitive datasets and avoid keeping data whenever possible. No such alternative exists for many applications of cryptography.

But we live in an era where organizations find our private data impossibly tempting and are content to sacrifice the rights of individuals so long as they can't fight back. This research gives such entities the excuse to build tools that should not be built and publish data that should not be published. By saying "OK now it's safe (if you did everything right)" rather than "don't do that", it is the enabler of future privacy fiascos.

The answer, if there is one, is probably legislative: hold entities criminally liable for data breach. Should such legislation pass, I wonder how much interest in this research will wane.


If you want to rip in to Apple or Google or Uber for claiming they should have a pass for using privacy tech, feel free. Understand that this is distinct from most research on differential privacy.

The US Census collects demographic data about as much of the population as they can manage, and releases summary data in a large part to support enforcement of the Civil Rights Act. They have a privacy mandate, but also the obligation to provide information in support of the rights of subpopulations (e.g. Equal Protection). So what's your answer here? A large fraction of the population gets disenfranchised if you go with "avoid sharing the datasets".

You end up with similar issues in preventative medicine, epidemiology, public health, where there is a real social benefit to analyzing data, and where withholding data has a cost that hasn't shown up yet in your analysis. Understanding the trade-off is important, and one can come to different conclusions when the subjects are civil rights versus cell phone statistics. But you are wrong to be upset that math allows the trade-off to exist.


"Privacy tech" is a perverse description, since this tech's existence results in a net loss of privacy -- without it, the data-sharing applications it powers would be more obviously irresponsible and more conservative decisions would be forced. A less Orwellian name would be "Anonymization tech".

If it were possible to wish away this tech, I absolutely would -- just like I would wish away advanced weapons technology if I could. In our networked era, the private data of individuals is being captured and abused at an unprecedented, accelerating rate, and whatever good this tech does cannot begin to make up for its role in facilitating and excusing that abuse.


There needs to be a word to describe the practice of arrogantly explaining a researcher's own results back to them on HN. Maybe HNsplaining?


Not the person above, but I do, yes.

Heartbleed, side channel attacks, etc.

Implementation in the real world matters.


You're right these systems are currently flawed, and may remained flawed. But, the same is true for all systems.

Don't get me wrong I agree in theory that every company should not retain any customer information - I work for one of a few companies that does that, but, I also know that many company's will not make that switch (example a bank, or a medical record) and in such cases that data should be as secure as possible.

To counter the irresponsible promises you are talking about it'd be ideal to see compliance and security regulation like we see with HIPPA applied to all customer data, but until we evolve proper trust-less identification models and ways for users to self-secure and trustlessly validate their information then some businesses will always collect data.


I confused, you sound against the idea of differential privacy, even though the foundation of differential privacy is that anonymization DOESN'T work to protect people's privacy. In fact, the canonical example used in the differential privacy field of failed anonymization is the AOL fiasco.


I share your cynicism.

I did a lot of work in healthcare IT and have also obsessed about voter privacy (protecting the secret ballot).

I've seen de-anonymization in practice. I also know there's a huge chasm between best available science (and practices) vs the real world. (Dumb example: mis-redacting PDFs with opaque boxes instead of removing the text.)

#1 At best, like crypto, differential privacy may offer temporary protection, to data re-used (shared, in transit), if given assumptions are preserved.

#2 Also, like crypto, I have no confidence that anyone, any where will implement DP correctly, or even be able to prove they've done it correctly.

#3 The original data is stored somewhere. There is no DP story for mitigating leaks.

Given my disappointment, I believe (but cannot prove) there's two worthwhile strategies worth exploring.

First, contracts to use case box and time box the data, stating how and when shared data may be used, and then a drop dead time when shared data must be destroyed. Part of this contract could be expanded to include parameters for differential privacy. One org I work with has these policies. Alas, Thatcherite "trust, but verify" is tough. We add fake data (honeypot-esque) and have caught cheaters.

Second, I'm keen to further explore translucent databases, where data in situ (at rest) is encrypted.

Lastly, I'm always looking to see who is working in this space, and what they're doing. I'd like to believe that someone will crack this nut.


That data (AOL search data) wasn't intentionally released so while pseudonyms had been used, no serious effort at fully (or more fully) anonymizing or protecting users had occurred. Here the intent is to be able to produce data sets where individual identification will be statistically unlikely if not impossible (by fuzzing the data), or where individuals can refute the data because there's a statistical chance the data is a lie (probability biased in favor of truth so that the aggregate data is still useful).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: