Hacker News new | past | comments | ask | show | jobs | submit login
Plausible Analytics Isn't GDPR Compliant (paranoidpenguin.net)
54 points by ramboram on Oct 23, 2020 | hide | past | favorite | 75 comments



I think the article might be reading too much into it

Is Plausible actually tracking users? I mean actually allowing you to get a user's history (or IPaddr history) on your website across multiple days? (or a subset of this?)

If it does, then yes, it is not compliant without the user agreeing. If it doesn't, then no.


Everything is isolated. There's no way for us nor for our customers to get visitor history across days, across websites or across devices. See https://plausible.io/privacy-focused-web-analytics


Thanks for clarifying


Plausible Analytics is GDPR compliant - with one possible exception - the IP address which if they dropped the last 3 digits would probably be enough.

The blog post conflates general data points with PII. The IP address is considered PII.

While other info can be used for fingerprinting, it’s ok to use in some capacity as long as you don’t.

For background, I’ve done GDPR implantation a in the past, an a privacy advocate in that sense, and spent more time with lawyers in this subject then I’d care to admit.

(Pardon brevity/typos, on phone with unreliable connection)


The IP address, on its own, should not considered PII.

There was a ruling in Breyer vs. Germany that IP addresses can be considered PII – in certain circumstances.

The case was brought against an ISP, and the court ruled that the company had enough correlating data at its disposal to make an IP address de facto PII for any of its customers. The court limited its ruling, saying that with just an IP address alone, the protections associated with the directive wouldn’t apply.


GDPR simply classifies "personal data" as any piece of information that can be used to identify an individual. A static IP used by one person could therefore be considered personal data while a public IP shared between thousands of people behind carrier-grade NAT would not.

The problem is that you can't tell the two apart and decide when it's safe handle the IP.


Indeed. My dynamically allocated public IPv4 address, given to me by my cable company, has been the same for as long as I've lived here, over four years now.

Ironically, my IPv6 prefix can change several times a day...


IP addresses IP addresses are never PII. PII means information about a person who can be identified. In that context, IP adresses are an identifier, not the information itself.

If you store IP adresses in your customer database, the information is that a person with that IP is one of your customers. This information is considered PII if it's possible to use the IP to identify the person the information is about, e.g. using a government database of everyone's IP address. If the data never reaches someone with access to such a database, it's not PII.

(This is a somewhat pendantic distinction, but it matters legally. Data protection law doesn't care about which identifiers are being used, but about the data associated with it and whether it tells you something about a specific identifiable person.)


I was under the impression that they did not store IP addresses, though I could be incorrect.

Their docs suggest as much https://docs.plausible.io/excluding/

"Most web analytics tools do this by excluding certain IP addresses from being counted. However, we do not store the visitors’ IP addresses in our database for privacy reasons"


We never store IP addresses in our database or logs. See the full details of our data policy: https://plausible.io/data-policy


GDPR doesn't care about storage. Even if you just acquired personal information without processing it, you still had to be GDPR compliant.

In fact, the solution suggested above (only using a truncated IP address) would still require you to acquire and process the IP address and thus be subject to GDPR.


Thanks for clearing this up. The general data points and metrics we store are not personal data.

IP address is the only piece of data that we touch that is considered PII under some regulations including GDPR.

The IP address is fully anonymized by hashing it together with a daily changing salt. Old salts are deleted to as to prevent re-identification: https://github.com/plausible/analytics/blob/master/lib/plaus...

According to GDPR Recital 26, anonymized data does not fall within the GDPR at all because data is no longer considered “personal data” following anonymization:

> The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.


GDPR states “For data to be truly anonymised, the anonymisation must be irreversible”. So dropping 3 digits is clearly not enough to anonymize PII, it’s more pseudonymization.


How can an IP address without the last 3 digits possibly ever identify someone? That surface area is just way too large.


By using other information to narrow the pool of possible people.


Aren't the biggest corporations doing the same on orders of magnitude larger datasets? They get away very well with merging data from quite a few acquired companies.

If small companies are called upon compliance with such vehemence, the big ones who know so much of us should be brought up, at least 100x times more.


> Aren't the biggest corporations doing the same on orders of magnitude larger datasets? They get away very well with merging data from quite a few acquired companies.

Yes, and it's worth noting how few data points one needs to identify an individual.

>If small companies are called upon compliance with such vehemence, the big ones who know so much of us should be brought up, at least 100x times more.

Absolutely, no argument from me here.


I am curious, how are you going to unanonymise an IP to something that could have 255 combinations (and that's just if you drop that last part on an IPv4). Nevermind that an IP alone is not PII. How can you reverse something that has many possibilties?


>> IP alone is not PII

It is in Europe, despite some regional rulings (Germany?). It is not considered PII in the USA.


IP addresses are also explicitly considered PII by California’s CCPA.

https://leginfo.legislature.ca.gov/faces/billTextClient.xhtm...

(o) (1) “Personal information” means information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household. Personal information includes, but is not limited to, the following: (A) Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, account name, social security number, driver’s license number, passport number, or other similar identifiers.


That was true once. Longer answer "it depends":

“[I]f a business collects the IP addresses of visitors to its websites but does not link the IP address to any particular consumer or household, and could not reasonably link the IP address with a particular consumer or household, then the IP address would not be ‘personal information.”

Source: https://iapp.org/news/a/are-ip-addresses-personal-informatio...


You missed the paragraph:

"However, when the attorney general revised its draft regulations for a second time March 11, the guidance was struck without explanation."


Just to be that guy. There is a slight difference between Personal Identifying Information and Personal Information.


GDPR is EU law. So the regional rulings are extremely important for deciding what you think you can and can't do.

And I think we're missing the main point. How can it be reversed if there are hundreds of possibilites.


True. I was thinking more about how it drops some location level information.

I can't presume what Plausible does (have not read their docs in awhile) but they have commented here to provide more specific clarification that address IP usage (TLDR: what they do is fine and compliant)


Actually with CGNAT IP (and arguably before then) IP addresses aren't personally identifiable information.

That said, the GDPR is deranged and might define things differently. Blocking the EU is safer.

Of course there are research exceptions that you could drive a truck through, and logging is still valid, so none of this matters.


I've been looking into GDPR and when a cookie consent is needed. In fact, there's no thing called "cookie consent". If you track a user, you have to get his consent before doing it, whether you use cookie consent or now. Ever since I joined HN, there's a lot of marketing going on here from privacy-first Google analytics alternative guys. I found this review showing Plausible and similar products using browser fingerprints and CName cloacking for user tracking, and they still promote those features.

I'd like to know your opinion on this. Do I still need to use a consent banner if I use these services?

Thanks.


> If you track a user, you have to get his consent before doing it

This would mean any server-side analytics (looking at access logs, which include IP address and user-agent) cannot be used for analytics or tracking, since there is no way for a user to give/deny consent to a page that already has logged information on them.


You obtain consent and then you log only if consent was provided. You can essentially use two logs, one for technical purposes (under legitimate interests you should be fine logging as long as those logs are only used for technical/debugging/abuse prevention purposes and the data is not kept for longer than necessary) and one for marketing/analytics purposes. You only log to the second one if consent has been given, and you only ever do your analytics on that second log and not the first one.


It's also probably a legitimate interest to retain data for marketing and analytics purposes, so long as that retention meets the same sort of guidelines. Marketing is explicitly highlighted as one of the applicable uses for legitimate interest.


Have you any specific document or decision in mind ?


Recital 47 (https://gdpr-info.eu/recitals/no-47/) explicitly states:

"The processing of personal data for direct marketing purposes may be regarded as carried out for a legitimate interest."

It's also mentioned in Article 21 describing the right to object to processing using legitimate/public interest:

"Where personal data are processed for direct marketing purposes, the data subject shall have the right to object at any time… etc."

The ICO has some useful guidance on when it is an appropriate basis: https://ico.org.uk/for-organisations/guide-to-data-protectio...


One could argue that analytics purpose is not direct marketing purpose. My understanding is that as analytics can be considered as a usual/expected business process, it may use legitimate interests as far as it fulfill requirements (information of the process, the right to opt-out, ...). However, the problem is that analytics may be advanced analytics. Is the retrieval of Adwords parameters from a glcid allowed/expected ? Is the injection of historical behaviour or marketing segment allowed/expected ?


I would like to see more software having the option of just logging the users country and not the IP, and perhaps just as generic a user agent as possible (Just, is this Chrome, FireFox, Edge, whatever, but nothing else.)

for example for Nginx something like:

log_format logfmt '$remote_country - [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_generic_user_agent" "$gzip_ratio"';

That would assume access to a GeoIP database, but it would be helpful.


$remote_country is interesting idea, you classify visitor into per-country "buckets". Although the buckets would not be of equal size. If you have a single regular visitor from a tiny country, $remote_country could uniquely identify them.

A similar idea would be to have built-in $remote_addr_hash8, $remote_addr_hash16 variables which hash IPv4 and IPv6 addresses down to 8-bit or 16-bit numbers.

There are hacky ways you can do some forms of anonymization already:

https://www.supertechcrew.com/anonymizing-logs-nginx-apache/


FWIW, CloudFlare can inject a cf_ipcountry header that does that. User-agent parsing is unfortunately more complex, with lots of false readings (not counting bots & crawlers).


The reality is that GDPR is not strongly enforced at the moment. This is not uncommon for Europe and may be a cultural differences with other places.

Those who have the intent to comply and are at least complying in spirit are not at any legal risk. Attitude matters.

And the spirit is obvious: get consent if you enable a third party to unique identify a user in reality. I.e. if it's private data or if you enable correlation across websites.

It's correlating and sharing you need consent for. Don't worry about a server log.

It is not about what you make possible. It's about what you do. Technically any sysadmin can access some information they should not. It's unavoidable.

But that's quite a far way from commercially exploiting databases of people without their consent.

Honestly they should just ban the sale of personal information. Most internet marketing vendors are not actually in the business of selling personal data.

Now the good ones suffer because of the bad ones. And the bad ones either pretend they have consent or find a way to get it.


I think that overall the GDPR law was good for privacy but a disaster for usability.

It was good for privacy, not because it's enforced or not and not because sites are showing cookie consents, but because it made the public more aware of centralization/privacy issues on the internet and companies a bit more careful with data processing. This law also resulted in many "privacy-friendly" alternatives for various services, which in the end led to a healthier market and improved data decentralization.


If you're tracking an amorphous profile, how do you match the right person to the right data? Do you have to match the data to a unique person?


I don't have the answer, but the consent banners are interesting.

I have two browser plugins: "I don't care about cookies" and "Never Consent", I'm not sure what Never Consent doesn't technically, but the other one just hides the DOM element with the cookie thingy.

That means that I never see the "consent" banners so I can't click the "Okay" buttons. I should test to see how many sites just assumes OK to cookies because I didn't click "No".

On a positive note I do see more an more sites making it just as easy to say no to tracking as saying yes. Though sites are better at remembering a yes to tracking, compared to a no.


Not sure whether you mixed up I Don't Care About Cookies and the other one, But IDCAC does not just hide the DOM elements - it always gives full consent.

From their website [1]: By using it, you explicitly allow websites to do whatever they want with cookies they set on your computer (which they mostly do anyway, whether you allow them or not).

Which is fine for me, I use it with Cookie Autodelete, but if you don't, you should be aware of that.

[1] https://www.i-dont-care-about-cookies.eu/


Thanks, I used one at some point that just hides the element... Now I just use I Don't Care About Cookies and flush cookies when I close the browser.

But yes, something I need to be aware of.


Just FYI, tracking is so much more advanced than just cookies. Using IDCAC means you consent to them using any method of tracking you.


I think a lot of the confusion around the consent banner stuff arises from the 2002 EU ePrivacy Directive (ePD)[0] which long predates GDPR.

ePD introduced the idea of the cookie consent banners we see today.

While it was enacted in 2002, ePD didn't really start to come into broad legal force in many member states until ~2010ish (EU Directives are not like federal laws; instead they're implemented & enforced by individual member states separately).

GDPR's focus on prior consent makes consent banners in their popular format largely useless, but when GDPR came along, the intent was that PD should have been replaced by the accompanying EU ePrivacy Regulation (ePR)[1] to clarify this. ePR has been delayed, so we're in this ambiguous place.

[0] https://en.wikipedia.org/wiki/Privacy_and_Electronic_Communi...

[1] https://en.wikipedia.org/wiki/EPrivacy_Regulation


Not a lawyer, but you do not need a consent banner with their services.

This is as much about what information is available AND what you do with it. Browsers send information whether you ask/use it or not.

At a high-level (and not necessarily speaking about Plausible here cos I don't know the inner workings), it is ok for a service to use personal information (looking at the IP address here) if in a form that is not traceable back to a user, and not used for tracking individuals.

In this case the use of CNAME is fine, its just to stop the blunt blocking of JS etc that happens as a reaction. Its worth noting that GDPR does permit data collection for essential services and (there is some dispute/debate on this) basic site analytics can be considered essential services.

In regards to Plausible, they are commenting directly here and seem to be address all these concerns.

IMHO the blog post author sees a problem at the surface level but is not an expert - but for those of us more familiar with the legal framework behind this, the exceptions, and the distinctions of how information is used (and supporters of GDPR), what Plausible doing is good and compliant.

(To be clear; I'm not affiliated with them - am just supportive of GDPR friendly alternatives like this one)


Cookies aren't regulated by the GDPR[0] but instead by the ePrivacy Directive.[1]

Article 5(3) of that directive states that

"Member States shall ensure that the use of electronic communications networks to store information or to gain access to information stored in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned is provided with clear and comprehensive information in accordance with Directive 95/46/EC, inter alia about the purposes of the processing, and is offered the right to refuse such processing by the data controller. This shall not prevent any technical storage or access for the sole purpose of carrying out or facilitating the transmission of a communication over an electronic communications network, or as strictly necessary in order to provide an information society service explicitly requested by the subscriber or user."

In other words, unless the cookies are strictly necessary to providing you with the service then you must provide users information about what the cookies are used for, and you must offer an opt-out.

(It's also worth pointing out the generality of this Directive, too: It doesn't only apply to cookies, but also to things like localStorage).

The ePrivacy Directive is, as its name suggests, a Directive which is addressed to member states of the European Union which have all written it in to domestic law.

In the UK, for example, it was implemented as PECR[2].

[0] The ePrivacy Directive does reference the old legislation that the GDPR replaces, so you should consider the reference in the ePD to Directive 95/46/EC as a reference to the GDPR. This means the standard of "consent" is the GDPR's standard now.

[1] https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A...

[2] https://ico.org.uk/for-organisations/guide-to-pecr/what-are-...


Cookie consent is (mainly) a different EU directive and not part of GDPR. It will be newly regulated by the - long delayed - ePrivacy directive.

"Cookies are an important tool that can give businesses a great deal of insight into their users’ online activity. Despite their importance, the regulations governing cookies are split between the GDPR and the ePrivacy Directive." https://gdpr.eu/cookies/


The cookie banners come from the ePrivacy Regulation and are supposed to inform you that the website is storing data on the your device and that you can opt out (not in) of it.

Consent is required by GDPR but not for the technical circumstance that you store a cookie but that you use it for profiling. Some lawyers argue that basic web performance is legitimate interest especially in e-commerce, others don’t risk it and ask for consent (which is strictly opt in).


If you're tracking a user in the EU, you need consent. The GDPR doesn't cover the 'how' -- just that it needs to be done. So, if there's tracking of any kind, you'll need consent.

Applies off site as well -- pretty much every cold email tracking software, like Yesware, is in violation of GDPR, since you didn't get the recipient's consent to track their opens and clicks.


Consent is one of the legal bases for processing personally identifiable information[1]. There are five more, among which "legitimate interest" can cover a variety of cases.

[1] https://ico.org.uk/for-organisations/guide-to-data-protectio...


Yeah, but the "legitimate interest" implies that the processing is necessary (because it override your consent). In which context and what kind of analytics is really necessary ? Analysis of the incoming channels ? Understanding if there are some technical problems ? Comparing engagement from different marketing solutions ?

I'm working on that market and find that interpretation is quite difficult as soon as you have multiple actors around the table. Example: because recommendations from DPAs are not exactly the same, then you may have different requirements of the same company from different country legal department within the UE.


One interesting thing about consent under the gdpr is that users can later withdraw consent, and if that is your only legitimate reason, then you have to get rid of all the related data. It's best if you can show that there are multiple legal bases.


doesn't the GDPR protect against storing "Personally identifiable information"? Plausible does use the IP address for the visitor to create a unique visitor ID, but it does not store it, so I am not sure how can you use that information to link it to an individual.


The GDPR regulates the use of "personal data", which is broader in scope than "personally identifiable information":

"‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person"


If the algorithm for turning an IP address into a visitor ID is reversible then that ID is equivalent to the IP address as far as the GDPR is concerned.


I could not easily find it on the website, but I remember reading about how they do it, basically the ID is generated by hashing the IP + user-agent + a salt key that is changing on a daily basis.

So, no, I do not think it is deterministic.


We generate a daily changing identifier using the visitor’s IP address and User Agent. To anonymize these datapoints, we run them through a hash function with a rotating salt.

hash(daily_salt + website_domain + ip_address + user_agent)

This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. Old salts are deleted to avoid the possibility of linking visitor information from one day to the next.

Full details are here: https://plausible.io/data-policy


It depends on whether they retain or can reproduce the salt for a given date.

The rule in effect is- a person knows the IP their ISP granted them on the dates they were granted. They ask- do you have any records of me from these IPs on these dates.

Assuming Plausible keeps the record of salt by date, the answer is yes, we have records of you, because they can retrieve the salt, recreate the ID, and locate the records.

If they do not retain the salt, in contrast, they cannot respond to individual requests for their records and that would also imply they are not able to do day over day returning visitor calculations.


Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. So yes, there's no way for us to know whether the same person returns to a website on another day. See https://plausible.io/data-policy


That is deterministic, but the key thing is that it is not reversible


Technically, you could enumerate all four billion IP addresses (multiplied by all common user agents) to reverse it. This is, however, prohibitively expensive for tracking, so I think it does the job.


Not without the salt, which they delete every day. Pretty much impossible.


Is the salt key stored, or is it discarded?


Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. See https://plausible.io/data-policy


Note that anything deterministic on IPs is reversible. There are only 4 billion IPv4 addresses so brute forcing is trivial.

It is more complicated for IPv6 but enough of the internet is IPv4 that you can't ignore that case.


Nitpick: if it's reversible, determinism doesn't matter.


Yep indeed, deterministic isn't really the right word here. Reversibility is all that matters, although am I correct in saying that it would imply determinism?


> am I correct in saying that it would imply determinism?

I don't know, because neither "reversibility" nor "determinism" are precisely defined (this is not criticism of your comment in any way).

Here's one semi-reasonable interpretation of the two words for which reversibility would not imply determinism: Imagine a "process" (I, too, am being imprecise and calling this a "process" instead of a function) that takes as input an integer between 1 and 6 inclusive. Its output for the input n is a dice roll with a dice that is biased in favor of n, but is otherwise fair. Now, this is not a deterministic process, but if you are allowed to feed it the same input multiple times, you can probablistically reverse it.

Anyway, sorry for the tangent – your original point was the important one.


The point to note here is "if". Happily, they (Plausible) don't.


It's not reversible, it's hashed with a daily salt.


I have the feeling that GDPR and Cookie consent laws themselves, ironically, make harder for the services to provide privacy.


How so?

Storing a "user has opted out from tracking cookies" binary flag in a cookie is not the same as storing an unique identifier in a cookie.


Most websites are not GDPR compliant, if you don't like it then lodge a complaint with the relevant regulator.


a.) The term "GDPR Compliant" does not exist. All software can be "GDPR Compliant" and still do fingerprinting it there is consent or necessities (hard to do). What they mean is that you do not need to get consent from your users to use Plausible.

b.) They don't store IP addresses. Information they gather are not stored in a way to build user profiles or do fingerprinting.

It doesn't look like the articles author took a look a the Plausible documentation or source code.


I've was implementation lead for several GDPR implementations in Germany. Only on HN would a comment with facts that clarify a subject where a lot of misinformation exists get downvoted.

If you've downvoted that comment you have done the community a disservice.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: