Hacker News new | past | comments | ask | show | jobs | submit login
Unsanctioned Web Tracking (w3.org)
86 points by cpeterso on July 17, 2015 | hide | past | favorite | 36 comments



I work for an analytics company that doesn't share data with third parties and doesn't use any questionable persistence techniques (we set a domain cookie, that's it), and I can say that unsanctioned tracking hurts everyone.

We get lumped in with super questionable networks in tools like Adblock and Ghostery and I have to explain to people when I tell them that I work in analytics that we aren't evil because the default assumption is more and more becoming that analytics and ads are all doing shady things. That sucks. I wish there was a way to combat unsanctioned tracking without diminishing the value that legitimate, and user controlled, tracking and attribution tools provide.

As long as people are willing to do anything to squeeze out more money or beat their competition, questionable techniques will find their way into the market, and that's bad for us all :(


It's extremely difficult to determine good vs bad analytics companies. Every analytics company is tracking data for multiple websites and therefore can track people across the web. How is the user suppose to know what you are doing with their data? Even if you aren't doing anything now, how can they be sure that won't change, especially if the company is sold.


This is sort of true but it depends. We set a first party domain specific cookie. We can't track a user across different domains, or customers. Technically we could correlate based on IP and activity times, but it's not the same as setting a super cookie that is shared between sites.

You are still right, how is there user supposed to know if one tool is reputable and another isn't. Worse than that, one may be fine today but then it gets acquired and someone starts putting the pieces together and uses your historical data in good_tool to link you in bad_tool.

I don't have a good answer. As the other commenter mentioned, regulation may help.

Beyond that, some kind of standardized policy that can be checked and tested would be nice.


    We can't track a user across different domains, or customers.
To be super clear- yes you can. You don't. That's very different. With full JS access on a site you have the ability to collect a lot of information. As another poster mentioned, it only takes 30 bits of entropy to identify all 3 billion internet users.


You're right. My bad. We don't. Not we can't.

Technically, though we can't, since we'd have to dedicate engineering time to making the changes necessary to do that kind of tracking, and we're not going to :P


There are technical approaches to the problem, it's just that they're not ready yet.

However (and without revealing too much about specific IP), there are various mathematical techniques that can be used to discuss the privacy disclosing properties of queries against a set of data and ways in which you can force the data to be both fuzzy and self-destructing.

The reality is that customers (rightly) won't see a difference between analytics companies until one of them actually puts their money where their mouth is on privacy, and removes personal data from being an asset than can be sold or combined later (both by limiting number of queries to prevent reconstruction and with fuzzing applied to prevent over-targeting).

For those looking for some technical details: I work at a marketing company (we're early stage start-up, so I don't want to name names, as we're not really ready for attention), and we're currently in the process of developing a technology which fuses the research in to what questions you can ask of a database (and how many) before the privacy is depleted with non-fully-homomorphic encryption, which destabilizes after a period of time. This approach will let us build containers which homomorphically process data, and decay in a number of questions less than the number which would deplete the privacy of the data in the container. (For various reasons, operating in homomorphic space gives you increasingly fuzzy answers the more questions you ask, and thus you cross a probability bound on your answer being useless near when the privacy is depleted.)

While this isn't perfect, since you still need to trust that we're both a) not storing the base keys and b) did our math/implementation correctly, it's a substantial improvement over the trust asked for by marketing companies now. (Especially since contracts could require us to do those things fairly easily, thus exposing us to breach of contract if we failed to. Once those actions are properly completed, there's no taking them back, so our technology mainly protects the case of "We're the good guys now, but tomorrow is a different day".)


Interesting. I'd love to chat more about this or hear when you're further along and can share more!


I work in advertising. The biggest problem is bad actors have ruined any and all trust for the last decade. What we really need is regulation and a good auditing process (technical, not administrative) to certify that companies are actually doing the right thing.

That still doesn't solve all the political dealings that happen but it would be a good strong start to cleaning up a lot.


What would you suggest as a good starting point for regulation?


We already have the Do Not Track header which most big and legally compliant data management platforms adhere to. The Content Security Policy header can be improved to add extensions for data handling.

Follow this up with certification by a main industry body and legal foundation for how companies in the sector should operate with data and we can finally make progress. This is no different than how finance companies or health companies go through audits and such with both process and data handling.

While digital advertising doesn't need to be as onerous and involved as that, the same workflow would help along with clear rules for collecting, analyzing, acting on, purging and securing data as part of consumer wishes and best practices. The next step would be to make sure transacting with a non-certified company carries risks and perhaps fines so that media buyers actually work with good vendors based on merit instead of inside deals.

There should also be technical certifications so that companies can actually prove they know what they're doing and dont just cram megabytes of crap on a page to serve a single ad. It's just too easy to whip up some crappy ad server software these days which ends up bloating sites and ruining trust and user experience.

Unfortunately there are issues with regional laws applying to a global business and the large amount of shady networks headquartered internationally but it's a good first step.


Make it unlawful for javascript source to contain nothing but whitespace.


How is that helpful?


Browsers should (as default setting) disallow third party JS, problem solved!

Analytics and ad-network company could create on-premise products that can be installed on the website server, problem solved.

Win-Win.


Tracking started with single site tracking and worked well with a simple cookie. There was no overhead and no bloat like heatmaps and 5-level nested scripts that display an invisible pixel. It was not evil except that users did not know that they were tracked.

Then the evil started with multi-site tracking. Trickery was required to implement cross-site tracking and the advertisers became too obsessed with 'knowing the user' that they overdid it. I would like to see proof that today's excessive tracking really pays off with higher click rates.

A lot of trickery uses javascript and there are only a few users who disable javascript since almost all sites uses it and sites get (seriously) crippled if javascript is disabled. I therefore am waiting for a browser vendor who recognises this problem and comes with a 'Javascript light' where only a small subset of javascript -- just enough to build great responsive websites -- is supported. Javascript light will not allow to generate heatmaps, invisible pixels and upload system information. Surely you will say 'web programmers need system information' but this can be provided in a different way, i.e. browsers can send a header with the information that they want to give.

Another evil are the social networks since they have their 'like' button on almost every website that you visit and hence know exactly what a user does 24 hours per day. Just like the 'do not track me' feature, the web needs a new 'no like buttons for me' feature.


What features would this light version of Javascript support and what features would it not support. Obviously `new Image()` is out, as is any kind of cookie support right? So that means no ability to log in? I don't think any browser vendors would bother spending developer time on a project that would severely break pretty much every site out there.

Not to mention, you must realize that any and all methods that were available would eventually be exploited for tracking users in the absence of the usual methods.

This is a situation that is only going to be "fixed" with legislation and regulation.


I am not against legislation but also not optimistic about politicians being able to make a proper law.

You are correct, 'new Image' in javascript would be out. And the question is, if that really hurts. Do you want to dynamically create an image in javascript and let the decision-making happen inside the browser of an end user? Or can the 'new Image' functionality of javascript be considered bloat because the decision to put an image somewhere on the page can be made perfectly well on the web server?

And we can go on, should it be possible to do a POST from javascript code, or should a browser only allow/do a POST when a user presses a button?


Encourages browser vendors to expose appropriate controls to users who wish to minimize their fingerprinting surface area.

This is something everyone should be encouraging - unfortunately, browser vendors seem to be slowly removing and/or making more opaque any configurability, in the name of "simplicity". I agree it certainly is simpler to not think about web tracking or privacy at all, but perhaps these are things worth thinking about.


It's really difficult, though. There are dozens of different ways someone can be fingerprinted [1], and you may have to make serious or even crippling changes to browser features to mitigate some techniques.

It's not fair to say "privacy is dead", but I think it's safe to say "browser fingerprint evasion is dead". Solutions to easily and selectively block unsavory companies and networks (ad or otherwise) while letting users allow some things, like uBlock and uMatrix, are probably the only feasible solutions.

[1] https://www.chromium.org/Home/chromium-security/client-ident... (This still isn't a totally comprehensive list)


[deleted]


Do you have an example of this working? I thought that browser venders we're having the :visited selector lie to you when you call getComputedStyle on them? Also, how would you workaround the need for JS? I understand that you can do something similiar with tracking pixels, but I'm under the impression that Ghostery blocks them.


>I thought that browser venders we're having the :visited selector lie to you when you call getComputedStyle on them?

It's possible to probe the users history even though getComputedStyle doesn't give it away anymore.

See page 6 of this article: http://www.contextis.com/documents/2/Browser_Timing_Attacks....

Obviously turning off Javascript prevents these types of things to some extent, but even then there are ways: https://www.nds.rub.de/media/nds/veroeffentlichungen/2014/07...

The web is just not designed with preventing information leaks in mind.


Interesting papers, thanks for calling my attention to them. They made me paranoid enough to disable the styling of visited links in Firefox to prevent the large amount of timing attacks that are possible.


Returning bogus information to trackers is one way to fight back. If the tracking data quality can be destroyed, it will become useless to advertisers and tracking companies will go broke.

Meanwhile, use Ghostery and block everything. A few sites won't work, but there are better alternatives for most of them. With all tracking blocked, you can't watch ABC-TV, but in exchange, commercials are skipped on CBS-TV shows.


Can someone tell me how these super cookies work? Cookies can only be read in the domain/origin they were set, so how is a cookie passing data off to other domains?

Or is it just as simple as two companies working together and fingerinting and matching thereby making that called a super cookie?


SuperCookies are generally considered to be cookies that recreate themselves after deletion. A variety of techniques are used for this, ranging from cooperating domain to flash cookies to local storage.

Information us passed between sites using JavaScript and backend cooperation on ID matching.


I agree that supercookies and header enrichment should be prevented whenever and however possible (e.g. header enrichment will be solved by requiring encryption à la Let's Encrypt), but fingerprinting is a lost battle that we should all give up on.

We will never be able to solve fingerprinting without upheaving the entire web platform as we know it. So many web APIs are simply not possible without exposing some UA capability and configuration variance. For example, it is impossible to support WebGL without exposing additional UA variance for fingerprinting.

Computers will always have varying capabilities and configurations, and developers will always need to consider some of them. A world without fingerprinting is a world without the modern web.

It only takes ~30 bits of entropy to uniquely fingerprint all ~3 billion internet users. We already expose this much variance entropy, and it is only going to increase as the web gets new features.

I implore you all to simply give up on fighting fingerprinting. Try to stop worrying about it, as there's almost nothing we can do short of the nuclear option of removing every API that exposes UA variance (which will make the web less useful).

We have already lost, and every new feature makes the hole a little deeper. The hole is already too steep to escape, so accept that you will be tracked by colluding websites whenever you browse the web.

---

The W3C page comes to roughly the same conclusion, but recommends a very drastic and dangerous solution: legislation. I fear that legislating this issue will legitimize only select pre-approved uses of UA variance entropy, and will hinder developer innovation in the long run. I'd rather be fingerprinted than be held back by legislation as to what browser data I'm allowed to read, and in what manner I can act on that data.

Please do not lobby for legislation in order to fix this problem. W3C's proposed solution will most likely only cause more harm than good. I would be deeply upset if "intent to fingerprint" became an actual crime.

One example of something useful that such legislation may make illegal is my navigator.hardwareConcurrency polyfill[1] that runs a timing attack on your CPU (not unlike "The Spy in the Sandbox" linked to in the W3C page) to figure out how many cores you have. This information is actually useful for optimizing heavy multi-threaded webapps, but it is also directly useful for fingerprinting. Future legislation could make it so that using my polyfill, even for benign purposes, counts as "intent to fingerprint".

People do not deserve jail time or fines based on if a tech illiterate jury judges you to harbor "intent to fingerprint". The future will be a very scary place for developers if you actually have to worry about this.

[1]: http://wg.oftn.org/projects/core-estimator/demo/


An alternative to DNT header might be a "Do Track" header: the browser generates a unique user ID instead of saving server-generated cookies. Users could control which sites receive their ID, manage multiple IDs, or they could reset their ID (like clearing cookies). Sites that still insist on active fingerprinting could be penalized by (opinionated) browsers with scary icons in address bar (like mixed content warnings).

I think there is still value in minimizing passive fingerprinting because it allows servers that don't serve active content (e.g. third-party image servers) to track users.


I think it is entirely reasonable to treat fingerprinting as ipso facto user hostile; that it is critical to the continued success of certain business models is reason enough to force abandonment of those models.

If we decide that preventing fingerprinting is a good, then yeah, certain technologies will be out of bounds. So? This is exactly the way a society works.


> header enrichment will be solved by requiring encryption à la Let's Encrypt

I doubt this is the case. How many bits can be manipulated in client TLS headers by a MITM without causing breakage? I find it hard to believe TLS leaves no room for malleability early on in the handshake.


This defeatist attitude is dangerous, especially when the solution is simple: just stop browsers from leaking >= ~30 bits of entropy.

Deprecate HTTP headers that leak entropy (like the user agent). Rewrite fields like If-Modified-Since so they can only express a value quantized into values no smaller than days. Remove JS APIs that leak information (like the ability to read CSS attributes). Impose stricter same-origin policies to eliminate 3rd-party cookies and javascript. Some people will complain that this breaks some use cases. Just as Dan Geer put it when discussing software liability, "Yes, please! That was exactly the idea."

Unless a platform puts user safety first - without exception - then it inevitably creates moral hazard. If for some reason this does not entirely fix the problem, then we apply the force of law - just like we do in every other area of society. If this concerns you, you should encourage self policing and removal of the business models based on any kind of fingerprinting, so no legal remedy is necessary.

People may indeed deserve jail time (or other legal remedy) for stalking. Technical literacy does not except you from social responsibility. As for your concerns about a jury: the problems with our legal system are far broader than your concerns over "technical literacy. A lot of work is needed in that area, with great urgency. That aside, a jury is also not expected to be an expert in advanced kinematics when they hear a case involving cars that crashed into each other at an intersection. It is the responsibility of the lawyers involved to explain such technical details to the jury. My grandfather - a physicist who reconstructed accidents and a frequent expert witness - has given quite a few remedial lessons in physics from the witness box.

I understand the concern about having to worry about this kind of legal threat. It is scary, but you will learn to live with it, just like surgeons learn to live with the possibility of malpractice charges or civil engineers that could be liable if the building they design falls down. Really, the concerns of a developer shouldn't be that bad compared to the doctor or civil engineer who have to worry about people dying if they make some kinds of mistakes.

What I find a far scarier future would be the future where people are not only afraid to speak their mind out of fear of being recorded, but where they are afraid to even seek out knowledge because of the trail it leaves. Our judicial system certainly has problems, but I'll take it over de facto feudalism, where the only people that can freely speak their mind are the lords that control the aggregate databases of everything their peasants do.

By the way - while it certainly isn't perfect, the EFF's Panopticlick tool reports my browser as only leaking 14.03 bits of entropy. The user agent accounts for ~9 of those bits, and ~4 more bits from the HTTP accept headers. Both of those are trivially removable, and the remaining entropy would not be easily to fingerprint. I'm sure this analysis misses some entropy sources, but this should be sufficient to show that it is possible to fix this problem.


Panopticlick doesn't use everything available. WebGL alone adds an additional 5.11 bits of entropy[1]. Other things such as your local network address from WebRTC, core count, etc. all can add a lot more entropy for fingerprinting.

[1]: http://arxiv.org/pdf/1503.01408.pdf


The easiest way to spot web bugs is to use a very old build of Safari. There are many other ways but if you have an old Safari select Window -> Activity then leave it open as you browse a few different domains.

You will see some 43 byte documents with huge long URLs full of query parameters, also one byte javascript sources. I block the ones I find in my hosts file:

   127.0.0.1 www.hosted-pixel.com
On some operating systems it may be better to use 0.0.0.0 but I am not completely clear. Alternatively block them in your firewall.

Among the reasons I dont install any other mobile apps other than those I absolutely require - not even free ones - is mobile analytics. The developer SDKs are all free as in beer but I have seen a photo of one of their data centers. Data centers are expensive; someone must be paying for all that.

You can edit the hosts file on iOS with iFile from the Cydia app store if you jailbreak, alternatively you can maintain the hosts file on your box then install it on your device with scp.

Similarly for Android but I dont know what text editors you can use to edit system files. Some Android devices enable you to install your own firmware build but I dont have a current list.

I am concerned about the impact analytics may be having on democracy. Its not really a secret ballot if the candidates all know what pages I read.


"Among the reasons I dont install any other mobile apps other than those I absolutely require - not even free ones"

This. People are up in arms over the web. If they only knew what each app in their pocket was sending.



Just last night it occurred to me to write browser add-ons that would scramble those query parameters, also send bogus user-agent headers but only for the web bugs.

That is, the add-on would discover one-pixel transparent gifs, take note of what query parameters they used, then every time it found that same gif in the future it would issue a GET with randomly selected parameters drawn from the instances of that same gif it had seen in the past.

For extra crispy electronic warfare, that same add-on could issue GET requests at randomly selected intervals. If those intervals were reasonably far apart (say one hour apart for any one gif) then it would not obviously be an attack on the analytics server.

What it would do is to make it completely useless to correlate your visits to different domains.


Why bother randomizing the query parameters, when you could just permanently cache the image?


If I permanently cache the image or block the server with my hosts file then the web analytics services will not track me.

If I randomize the query parameters then they will not track anyone because they won't provide useful information to those who presently purchase it. Of course that will require far more people then me to also randomize their query parameters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: