Edge sends full URLs of pages visited to Microsoft

wolf550e · on July 21, 2019

Eric Lawrence (PM on Edge, previously on Chrome, previously author of fiddler) showing screenshots of documentation that explains all that, and saying it's been documented this way for 14 years:

https://twitter.com/ericlaw/status/1152933704198758401

jammygit · on July 21, 2019

...and for 14 years no ordinary user has understood what those default settings do, instead assuming “their software provider wouldn’t do that...”

janbporterb · on July 21, 2019

I am all for bashing bad privacy options, but from the linked tweet at least, this IS off by default. I can certainly see enterprise scenarios where this feature is very useful for security reasons. Did I miss some context?

Edit: oh, this is in addition to the Mirror stuff. Yea this should never be a thing.

arijun · on July 21, 2019

I think you're making a big assumption that ordinary users care if Microsoft knows what URLs they visit--how many of them use blockers or another technology to opt out of e.g. Facebook tracking?

badrabbit · on July 21, 2019

Ordinary users don't know what a URL even is. But they do assume every site they visit isn't being logged by Microsoft, much like they assume their car's manufacturer isn't recording the places they drive to using GPS.

"Users don't care" is such an intellectually dishonest argument when users aren't technical enough to understand the subject.

dgzl · on July 21, 2019

Maybe we should differentiate "users understand and don't care" against "users don't understand and don't care".

On one hand, a TOS usually describes privacy implications. On the other hand, there's no accountability that a user has understood or even read the TOS. Is it unreasonable to assume people will read the fine print?

elliekelly · on July 21, 2019

> Is it unreasonable to assume people will read the fine print?

Unequivocally, yes. Even if someone hypothetically had enough time to read every TOS they've agreed to I would venture to guess the average user wouldn't be able to comprehend most of the terms[1] anyway. Disclosure documents are written by lawyers (corporate CYA) for lawyers (regulators, plaintiff's counsel, judges).

The law, however, disagrees with me entirely.

[1] Edit to clarify I mean "terms" as in the terms of the agreement

dgzl · on July 21, 2019

Delegating privacy rights feels like delegating monetary rights. As in, negotiating contracts for payments. This is coming from the perspective of a simple person trying to understand this complexity. Anytime I sign complex and impactful contracts, it's usually done in person with physical paper and a general understanding between both parties. An online agreement feels like scam artist's work, only after we've realized the value of our personal data. Maybe we're giving too much away at the notice of typed words? We're essentially giving away resource because we didn't realize the importance of it.

craftinator · on July 21, 2019

This needs to include the class of users who'd don't care because they don't understand. If they knew, they'd be upset too.

DannyBee · on July 21, 2019

[citation needed]

Most users probably don't assume anything, it doesn't even enter the lists of concerns or cares they have or ever would. It's not that they don't understand the subject, they simply do not give a crap one way or the other.

Random family in the US or elsewhere simply does not give a crap about things like this, no matter how much you want them to.

It's not ignorance, and the most intellectually dishonest arguments are really the argument here - "If people knew more about the subject or could know more, they'd care!".

If they knew as much as you did, they still wouldn't care.

badrabbit · on July 21, 2019

I disagree in many ways. If someone knew exactly what type of information can be infered based on the information collected and all the many ways it can be used against them they would care.

Let's say they knew Microsoft can share this information with third parties which might affect their credit worthiness,employability and other very real and immediate aspects of their lives? Let us say the quality and availability of goods and services might change if MS or a third party used this information against them? What if MS can use ML to infer very personal and private aspects of their lives?

By your logic 4th amendment search and seizure restrictions don't matter either because most people assume the government can search them anyways.

There is a reason you have the right to legal representation when accused before the law, you don't understand the law well enough to answer questions and argue on your own. Tech isn't that different, people don't understand it well enough without relying on factual opinions of professionals in the field.

bandushrew · on July 22, 2019

You are completely wrong, in my experience. I just recently had a chat with my 80 year old mum, and SHE was shocked and unhappy to find that google tracks her search history, let alone everything else.

I found myself in the unexpected position of trying to explain why (IMO) it was ok and not a big deal that they did that. I dont know whether I convinced her.

People assume they have privacy when they are alone. Its that simple.

IMO its not ok for any company to change that basic private-when-alone without a very simple, very clear explanation.

DannyBee · on July 22, 2019

Sorry, but repeated studies on the issue basically find people simply don't care:

"Only 33.3% were concerned about their personal contact information like email address and phone number being shared with third parties, while 10.2% were worried about their friends and contacts being shared."

Can you find me one that says most did?

Because I can't.

Truthfully, if people cared half as much as hacker news thinks they do/should, it would already be a solved issue.

bandushrew · on July 23, 2019

I dunno, maybe thats asking the wrong question? Im not sure why you are focused on contact information?

I can easily imagine people not caring about third parties knowing their phone number and email, thats already a lost battle. Everyone is used to giving those out.

Their search history though? their private chats with their friends? the websites they visit when they are alone? Their medical history?

The things they have bought themselves in the last 12 months? their favorite magazines?

Take some time and talk to people, I agree with the studies you found that said people dont care about their contact information, but ask people about other information.....search history is expected to be private, IME, by people who are not tech savvy. The same with their other private actions.

bandushrew · on July 23, 2019

https://www.pewinternet.org/2015/05/20/americans-attitudes-a...

""93% of adults say that being in control of who can get information about them is important; 74% feel this is “very important,” while 19% say it is “somewhat important.” 90% say that controlling what information is collected about them is important—65% think it is “very important” and 25% say it is “somewhat important.”

amelius · on July 21, 2019

That doesn't make it any better, though.

Most people don't care what's in their food exactly, but that doesn't mean the producer has no responsibility.

tareqak · on July 22, 2019

Let’s take a step back here, and ask a few questions.

1) Is it possible to care about something that you don’t understand?

2) Is it possible to care about something whose existence you aren’t aware of?

I personally don’t think anyone can have a meaningful discussion about this issue without answering and agreeing on the answers to at least those two questions I ask above.

Elpomm · on July 21, 2019

Agreed, the average user simply does not care about privacy to that degree. FaceApp is yet another example. On the bright side, most of the tools we need to "cover up" are available to us. You can't ask for more than freedom of choice, I suppose.

GrayShade · on July 22, 2019

Telemetry is also documented to collect your browsing history at the Full level (which is the default, I think), but everyone here focuses on SmartScreen instead:

From https://privacy.microsoft.com/en-us/privacystatement:

Browser activity, including browsing history and search terms, in Microsoft browsers (Microsoft Edge or Internet Explorer).

gsnedders · on July 21, 2019

(Also on the IE team before leaving to work full-time on Fiddler. Pretty sure he was on the IE team 14 years ago.)

pacifika · on July 21, 2019

Can’t it just send a hash of the domain and of the full url for checking?

notafrog · on July 21, 2019

I'm guessing a single hash could be problematic for detections based on the domain for example. But this could be circumvented by sending hashed parts of the URL.

For example they could hash the domain, path and query separately.

roca · on July 21, 2019

Sending a hash of the URL or individual parts of the URL is problematic because the server can easily execute a dictionary attack to identify those parts in most cases.

The "Google Safe Browsing Update API" (used by Firefox, Chrome, Safari and others) solved this a long time ago. In that protocol, the browser hashes the URL, sends a short prefix of the hash to the server, and receives a list of hashes for the URLs that should be blocked. A huge number of valid URLs all hash to each prefix and the server does not know which one the user has visited. Also, the client caches the list of hash prefixes for which the block list is non-empty, to avoid unnecessary fetches of empty lists, which further improves privacy and reduces response time.

Also, the client doesn't send any kind of user ID token to the server.

userbinator · on July 21, 2019

Is there even a "pure" browser anymore? By that, I mean one which by default will do this: if I enter a URL in the address bar, it will fetch that page and its associated resources. If I click a link on a page, it will fetch its destination (this is obviously a simplified view, ignoring things like JS on the page etc.) No other network activity, no "recommendations", no other attempts to be "smart" or "helping" by doing anything beyond "pure" browsing.

I see a lot of this extra functionality being justified in the name of "security", but I don't think the gradual erosion of personal responsibility and agency that results is something which should continue.

"The road to hell is paved with good intentions."

LeoPanthera · on July 21, 2019

I believe that Gnome's Epiphany (now confusingly called "Web") and KDE's Falkon are still "pure" browsers. Epiphany can integrate with a Firefox account, I think? But it's not on by default.

Epiphany only works on Linux but Falkon is available for Windows as well. Neither works on Macs but Macs have Safari which probably counts as "pure" if you aren't signed into iCloud. iCloud is pretty inoffensive even if you are.

dontbenebby · on July 21, 2019

Firefox might not meet your definition but with a few extensions (Containers, NoScript, uBlock origin) it's quite good at protecting your privacy.

gottam · on July 22, 2019

The only thing is they integrate things like pocket and they're putting some other things in soon.

ubercow13 · on July 22, 2019

Pocket is part of Mozilla

Aissen · on July 22, 2019

And it still is not open-sourced, two years and five months after the acquisition: https://github.com/Pocket/extension-save-to-pocket/issues/75

For comparison, it took Red Hat less than two years to open source Ansible Tower.

l_t · on July 21, 2019

surf [0] does that. "It is able to display websites and follow links."

[0]: https://surf.suckless.org/

agumonkey · on July 21, 2019

dillo is my weapon of choice

dredmorbius · on July 21, 2019

Dillo has all the modern-web functionality of lynx, all the keyboard-driven capacity of Safari, and all the UI/UX appeal of an abandoned brutalist shopping mall.

Sure is fast though.

Other than quick site previews, I rarely use it.

skarz · on July 22, 2019

For anyone who's wondering, Firefox handles Safe Browsing a lot differently. There's an overview here: https://feeding.cloud.geek.nz/posts/how-safe-browsing-works-... It's possible that this is outdated though.

The tl;dr is that

1. Firefox downloads a list of partial hashes every 30 minutes from Google. It's only the first 32 bits of each hash. This can easily be turned off in about:preferences.

2. When you visit a website, Firefox checks the beginning of the URL against the partial hashes stored locally.

3. If the partial hash matches the beginning of the URL, Firefox downloads a list of all of the hashes beginning with the partial hash it matched.

4. Firefox checks the full hash of the URL against the new list of full hashes. If it matches, the site is blocked. If not, you continue on browsing.

When I first found out how it worked a while ago, I was pretty amazed. It goes to great lengths to not know what sites you're visiting.

aawc · on July 22, 2019

[Disclaimer: I'm the TL of Safe Browsing in Chrome, and I worked directly with the author of the linked article for Firefox v4 support.]

FWIW, Google Chrome does the same because both Google Chrome and Firefox use the same Safe Browsing protocol v4. The linked post was written a while ago when Firefox still used the protocol v2 but the post is still largely accurate.

tga · on July 21, 2019

Somewhat related, I recently realized that Microsoft Outlook on Android fetches _all mail_ on Microsoft servers, then sends it over to the app. If this is mentioned anywhere when setting up the account, I totally missed it.

This led me to look for a "pure" email client that would _only_ contact my server. Popular apps like Blue Mail and myMail did not work at all behind a firewall that blocked random internet access.

I am happy to report that K-9 mail (open source) was what I was looking for: it never tried to access anything other than my mail server.

I wholeheartedly recommend using NoRoot firewall (or similar) to see just how many random servers your (flashlight) apps talk to (and watch them fail miserably when you block that).

ryankrage77 · on July 22, 2019

I think Waterfox meets your requirements. The only telemetry it sends is the version number so it can self-update (I haven't personally verified this).

jzelinskie · on July 21, 2019

Brave?

LeoPanthera · on July 21, 2019

Did you even read what he wanted? Brave is a crytocurrency wallet and advertising platform that just happens to be able to browse the web.

skarz · on July 22, 2019

No one should be surprised by this. Microsoft is Google, just with a price tag attached. (Google services are usually free of charge because they mine your data and target you with ads; Microsoft has discovered the concept of double-dipping and has now joined the party, while still charging for Windows.)

Getting you to sign into your computer with a Microsoft account is all about tracking and monitoring everything you do all of the time and selling a record of it to the highest bidder.

alwayslearning0 · on July 21, 2019

Edge is the number one browser in the world for installing chrome.

i_am_proteus · on July 21, 2019

Chrome is the number one browser in the world for installing Firefox.

fastball · on July 21, 2019

That seems unlikely.

saagarjha · on July 21, 2019

Is it? Have that many people been switching that they outnumber users installing it on their new machines?

skrebbel · on July 21, 2019

Definitely number two.

jobigoud · on July 21, 2019

After what? Edge or an older version of Firefox?

eitland · on July 21, 2019

Doesn't Chrome send your entire browsing history to Google as well?

Edit: I thought this was a well know fact and if it isn't I might have been to harsh about Google and Chrome.

Edit 2: Thinking about it and searching a bit I conclude that IIRC Google at least used to have access to your browsing history as part of syncing it unencrypted.

Youden · on July 21, 2019

Only through sync, which you can encrypt with a separate password to your Google account.

My sync is encrypted so I can't test this for you but I believe if it's not, you can check (and clear) your history here: https://myactivity.google.com/myactivity

Google's privacy policy seems to make it clear that this happens unless encryption is turned on: https://www.google.com/chrome/privacy/index.html#signed-in

janbporterb · on July 21, 2019

you missed the obvious one (that is actually related to the linked edge issue): SafeBrowse. SafeBrowse gets ALL your traffic from most browsers.

janbporterb · on July 21, 2019

Ah I was wrong about this!!! SafeBrowse uses hashes.

Edit: never be afraid to admit your were wrong folks :) So it only gets a “potential list” of sites you visited. Wonder if the operators could aggregate the data enough to deanonymize things

staticassertion · on July 21, 2019

The hash isn't even necessarily transmitted, and it's often partial hashes at that.

Panino · on July 21, 2019

> Doesn't Chrome send your entire browsing history to Google as well?

Most browsers/users do, essentially. When a person searches Google for example, results are links to Google that redirect to the target sites. Try it, mouse over a result, and look at the URL. Then a user clicks on one and finds a page that likely has doubleclick.net/adsense/analytics/fonts, which all feed back to Google. Or buttons/pixels/whatever for Facebook. Or both. Or both and 10 more organizations. Then since all mainstream browsers by default send referrer info, and since tracker code is so pervasive, and since browsers are easily fingerprintable, trackers follow you along as you click links going from page to page. Trackers are getting redundant high quality data. They're right there with you as you browse; they see what you see. Although some browsers are easier to configure for privacy, IMO the browser you use is less important than how you use it.

Multiple organizations have the potential to possess a near-complete view of your browsing history.

liability · on July 21, 2019

Everything you type into the address bar gets considered for possible completion, right? And part of that quite possibly entails sending it off to Google servers which take a stab at finding completions for it. Your claim seems plausible at least.

kps · on July 21, 2019

Yes, there are some options under chrome://privacy/settings that suggest they send full or partial URLs:

- Use a web service to help resolve navigation errors

- Use a prediction service to help complete searches and URLs typed in the address bar

and then there's

- Use a web service to help resolve spelling errors

whose description suggests it sends more than URLs.

Karunamon · on July 21, 2019

And, if I recall right, you're asked if you want search suggestions in the omnibar the first time you use it for that purpose.

liability · on July 21, 2019

Does that prompt effectively explain its impact to nontechnical users?

janbporterb · on July 21, 2019

Also note there is a “roulette” feature in most address bars, where the browser starts loading and rendering the site before you ever hit submit. Very handy for sites wanting to pursue persistence

im3w1l · on July 22, 2019

My activity[0] is Google's official way of telling you what it has collected about you from sources like Chrome. Things there will be used to personalize your experience.

https://myactivity.google.com/myactivity

jhanschoo · on July 21, 2019

If you are logged into Chrome your history is synchronized across platforms, tied to your Google account. Same as with Firefox Sync. Not sure about behavior when not logged in, or when incognito.

mnutt · on July 21, 2019

Firefox Sync is quite different than Chrome's version; with Firefox your data is encrypted locally on your machine before it is synced.

qieja · on July 21, 2019

That is also an option for Chrome.

cjs_2 · on July 21, 2019

But not the default.

staticassertion · on July 21, 2019

Are you sure about that? Certainly at least some of the synced information is encrypted before sync by default (passwords).

t34543 · on July 21, 2019

I would love to see a source for this.

auslander · on July 21, 2019

It is not so easy to do this MITM trick with Chrome, it has Google certs pinned down.

"For the transparent proxy to work, it needs .google.com to be added to the URL whitelist to allow all traffic to .google.com. This configuration is not supported because of Chrome security features that are in place, and we recommend that you avoid the use of transparent proxies." https://support.google.com/chrome/a/answer/3504942?hl=en

occamrazor · on July 21, 2019

This link refers to Chrome devises, like Chromebooks, not to the Chrome browser.

auslander · on July 21, 2019

You'll be getting "ERR_SSL_PROTOCOL_ERROR". I could not find any posts on internet that were able to MITM and analyse Chrome to google.com traffic.

asveikau · on July 21, 2019

If you sign in it syncs history across devices, so I assume it does. Although it would technically be possible that they encrypt it using a key Google doesn't have, I wouldn't assume they implemented that.

kps · on July 21, 2019

It's an option under the sync settings:

○ Encrypt synced passwords with your Google username and password

◉ Encrypt synced data with your own sync passphrase

myrandomcomment · on July 22, 2019

So stop using a browser from a company that has a source of income besides selling your info to install a browser from a company to which you are the product. Ironic, no?

Not saying MS is not doing anything wrong, but the lesser evil I think.

TimTheTinker · on July 21, 2019

The upcoming Edge is based on Chromium. So you won’t need to download it a second time :-)

dgellow · on July 21, 2019

s/upcoming/current/

cwyers · on July 21, 2019

You need to download Dev or Canary builds to use Edge Chromium (which I am typing this post on), it's not shipping as the stable Edge yet.

dgellow · on July 21, 2019

Hey, it seems that you're right. Thank you for saying this, I wasn't aware of that detail

sundvor · on July 22, 2019

In my case it was Edge Beta, and Firefox. Haven't even gotten Chrome installed on my PC.

zzo38computer · on July 21, 2019

You shouldn't need Edge, Chrome, or Firefox to install each other. If you know the URL of the file you want to download, you could use curl or wget. (Of course, that won't work if you do not know the URL. But hopefully someone will tell you if you are unable to find it on your own computer.)

jedberg · on July 21, 2019

Assuming you're on a machine with curl or wget installed. I haven't used Windows in 20 years, but I don't believe it comes with either one by default.

jodrellblank · on July 21, 2019

Invoke-WebRequest, aliased as iwr, shipped in Windows as part of PowerShell for 6 years, will do it with no browser.

w7 · on July 21, 2019

As of some recent Windows 10 build (edit: 1803 or insider 17063+), curl.exe (need to specify full executable name not just "curl", which is aliased to Invoke-WebRequest by default) is shipped by default as well. I believe it came around the same time as OpenSSH. You'll also find a tar.exe binary.

aembleton · on July 21, 2019

Does curl or wget ship with Windows?

a_t48 · on July 21, 2019

Yes, apparently -

    >curl
    curl: try 'curl --help' for more information

    >where curl
    C:\Windows\System32\curl.exe

jodrellblank · on July 21, 2019

No; to get it without going through a browser you'd have to install Windows Subsystem for Linux as an optional feature, then download a Linux distro from the Microsoft Store - arguable whether that doesn't involve a browser, really.

https://docs.microsoft.com/en-us/windows/wsl/install-win10

kabes · on July 21, 2019

Because google is any better? And engine-wise they're the same these days.

hiciu · on July 21, 2019

because the rumor has it that "Edge sends full URLs of pages visited to Microsoft" and we haven't yet caught google in such embarrassing situation

Yoric · on July 21, 2019

Well, by default, Chrome sends whatever you're typing in the URL bar to Google for auto-completion. No need to catch Google in such situation, it's a feature of Chrome.

detaro · on July 21, 2019

Edge does that bit too. Things typed in the URL bar are only a small subset of "pages visited", typically would not include session IDs etc, ...

Yoric · on July 23, 2019

Fair enough.

josho · on July 21, 2019

I assumed every browser did this as a part of their safe browsing features. Or is this different somehow?

antoineMoPa · on July 21, 2019

Following tweets explain that other browsers send an hashed URL instead. (but I guess it's easy to associate the hash with actual URLs)

eat_veggies · on July 21, 2019

The browser sends a prefix (first few bytes) of the hash [1], and gets back a list of full hashes to compare locally. This is called k-anonymity, and it's good enough for even services like Have I Been Pwned to use it for password lookups [2]

[1] https://www.chromium.org/developers/design-documents/safebro...

[2] https://blog.cloudflare.com/validating-leaked-passwords-with...

shakna · on July 21, 2019

Other browsers use a _truncated_ hash. Easy enough to match to a known list of bad sites, but exceptionally difficult to reverse engineer a list of sites the user has visited.

SideQuark · on July 21, 2019

Those hashes have to be essentially unique to prevent blocking good sites erroneous;y, and as such are easily associated to actual sites when you’re the scale of MS or Google.

hope-striker · on July 21, 2019

Here's what all the browsers do (Firefox, Google and Safari all use Google's Safe Browsing service).

1. The truncated hash is checked against local database.

2. If there's a hit, send the truncated hash to Google, and receive a list of full hashes for the truncated hashes.

3. Check if the full hash is in that list. If so, warn.

I'm not certain that the hash is truncated enough, but the general idea is a pretty straightforward trade-off between the amount of information Google needs and the amount of information your computer needs.

simias · on July 21, 2019

Not if there's an other layer of verification locally. Then you can allow a few false positives with the hash on the remote and they'll get discarded when the full hash is compared locally. That being said 32bits of information seems like quite a lot and can probably used to match the original URL fairly accurately.

liability · on July 21, 2019

You also need to consider the greater context. Suppose my history shows I sent a hash that could be one of 100 different known URLs. One of them is about model airplanes, but there is no way to know I picked that one. Chance I like model airplanes? All else being equal.. call it 1%

But what if over the next hour I continue to send hashes with random subject matter sets that include model airplanes each time. Suppose the intersection of the possible subject matters shrinks with each additional hash, quite possibly contains only model airplanes.

I visited 100 URLs in one hour. For each URL there are 100 others with the same hash but independent topics. Lets say there are 10,000 known topics, but every hash I sent has model airplanes associated with it. Now what are the chances I like model planes?

It seems clear to me that this scheme, with logging, reveals a lot in theory. But maybe solving this problem in practice would cost more that the data is worth. For now.

tialaramex · on July 21, 2019

You skipped a factor, in Safe Browsing Update API (the mode actual web browsers like Firefox use) that step where you send a 4 byte prefix only happens when the prefix already matched a local hotlist.

So if you visit 100 URLs in an hour, and using the local hotlist allows your browser to discard all but 1 of those URLs as definitely not bad, that's only one URL prefix checked against Google, not 100 so there's no triangulation.

And the actual number isn't 1-in-100, that's why they picked 4 byte prefixes. I haven't actually checked, but by eyeball from playing with this data for work I would guess 1-in-a-million.

So, you visit 100 URLs about model planes, one of them happens to be a 1-in-a-million match to a possible badware site, your browser sends the 4-byte hash prefix to Google, it gives back the 4-byte prefix of the badware site it was worried about which is different, your browser goes "Phew, good" and nothing happens.

There's just not really an opportunity for tracking here.

tialaramex · on July 22, 2019

Ugh, editing goof - that last bit of my explanation is wrong, Google sends the whole SHA256 hash of the bad site's URL and your browser compares -that- to the full hash it calculates for the model plane site. The four byte prefix will be the same of course, if it wasn't we wouldn't get this far.

noobermin · on July 21, 2019

Can this be changed to the normal twitter URL, not the mobile version?

marsrover · on July 21, 2019

When I removed the "mobile" it looks exactly the same. I guess there is no non-mobile version of Twitter anymore.

cannabis_sam · on July 21, 2019

There’s a super-fast, no bullshit, mobile-friendly, absolutely wonderful version of twitter you can get if you disable javascript... (i’ve only used it through tor tho)

PhantomGremlin · on July 21, 2019

I'm not sure about "super-fast" since, every time I visit twitter w/o JavaScript, I'm forced to click thru the message:

We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?

And this annoyance persists. Click on a twitter link within the twitter website and you get the message again. And again. It's definitely a dark pattern meant to annoy you into enabling JavaScript.

maximente · on July 21, 2019

no, it's likely you're either on a mobile device or your cookies are set so that it's redirecting you to it. the non-mobile version works fine and is clearly different.

alexgrist · on July 21, 2019

Not anymore, that’s the new layout.

rch · on July 21, 2019

I still see the old (superior) layout when I use an incognito window, so I guess the new version is being tested or rolled out gradually.

rnotaro · on July 21, 2019

Interesting, I've never noticed that.

seaish · on July 21, 2019

Doesn't it send full URLs anyway when you're signed into Edge or Chrome to sync your history?

octosphere · on July 21, 2019

Which is why I always harden Edge after setting up Windows by disabling SmartScreen

Insanity · on July 21, 2019

You could just use another browser like Firefox. :)

ahoka · on July 21, 2019

AFAIK Mozilla does the same.

jchw · on July 21, 2019

Mozilla does not send full URLs to a third party. As mentioned in the thread, they use “4-byte URL hash prefixes.”

Documentation: https://developers.google.com/safe-browsing/v4/urls-hashing

(Disclosure: I work for Google but have nothing to do with the safe browsing API.)

7373737373 · on July 21, 2019

Wait, this is not happening locally, against bloom filters or something?

est31 · on July 21, 2019

There is a first step which uses bloom filters, and if the filter gives a result, it is hashed and the hash is sent to the google service to double-check.

That being said, a four byte / 32 bit hash is enough to almost uniquely identify a website. There number of 32 bit numbers and websites is roughly the same order of magnitude. It's a problem without a good solution because if you create many collisions then you also generate plenty of sites falsely reported as phishing and which admin would want that to happen on their site. If you avoid creation of collisions, you have this identifyability problem.

There is this CRLite proposal [1] using layered bloom filters to stop reliance on web services. Maybe it can be adopted for phishing sites, as well.

[1]: https://ieeexplore.ieee.org/document/7958597

roca · on July 21, 2019

The safe browsing protocol computes (up to) 30 hashes of each URL, most of which include path components of the URL. So the space of URLs subject to hashing is far greater than 2^32, and the hash prefix alone tells you practically nothing.

Furthermore the protocol does not declare as URL as blocked just because the 32-bit hash prefix matches. If the prefix matches, the browser downloads the list of full hashes and checks the full hash against that list locally.

It's frustrating to see people jumping to erroneous conclusions about how Safe Browsing works when the spec is publicly available and quite clear. https://developers.google.com/safe-browsing/v4/

ryacko · on July 21, 2019

Just how large is the database?

Computing 30 hashes of each URL means you are sending a fuzzy hash... and 2^(32*30) is fairly precise...

aawc · on July 22, 2019

[Disclaimer: I'm the TL of Safe Browsing in Chrome] You can check by going to chrome://safe-browsing/#db-manager

ryacko · on July 23, 2019

The purpose of my question is that a 4 byte hash is only privacy preserving if there are substantial number of collisions.

Since I’m replying to you, I might as well ask, is Google planning on creating a locally cacheable DNS scheme that includes safe browsing information?

tialaramex · on July 21, 2019

Out of the box bloom filters let you set only false not-in-set probability to zero. The probability for false in-set probability varies, though it can be low if you choose parameters wisely.

What CRLite does achieves zero for both false positives BUT at quite a price. You need to know absolutely all the things that might ever be in the set before you start.

For CRLite they can almost wave that away by declaring that the set of things that might ever be in the CRL set is the set of logged certificates, so we can get that set from the log servers within 24 hours (the "Maximum Merge Delay" in public certificate transparency logs).

But you can't do that for URLs. The set of possible future phishing URLs has infinite size.

gruez · on July 21, 2019

>That being said, a four byte / 32 bit hash is enough to almost uniquely identify a website. There number of 32 bit numbers and websites is roughly the same order of magnitude.

Why not send less bits (eg. 24 bit) of the hash?

cyphar · on July 21, 2019

The utility of the hash for a user and some potential attacker is the same. Yes, passing 8 fewer bits means that they have 256 times more possible sites you might have visited. But it also means that you get 256 times as many false positives of websites being labeled as phising. This is what GP meant by it being problem without a good solution.

One possible option is to do what haveibeenpwned does, where you give fewer bits and then locally check. That would be a good improvement to the system's privacy, but you probably want to avoid downloading the hashes of every malicious website that starts with the given 3 bytes (I'd assume the list is quite large) for every page load.

tialaramex · on July 21, 2019

SafeBrowsing already does the thing you're describing as a "possible option".

Say I go to https://fakebank.example/security/login and Google has decided all of fakebank.example is a phishing site.

My browser computes [among other things] SHA256('fakebank.example') and then it snips off the first four bytes and compares that to a large dataset it got from Google. It fetches updates to this dataset every few hours. Sure enough the four byte prefix is present in the dataset.

So, we've got an alarm - it calls Google, but it doesn't tell them it's thinking about https://fakebank.example/security/login at all, it just tells them the 4 byte prefix. Google responds with a list of full SHA256 hashes beginning with that prefix that it considers _right now_ to be phishing. The list might be empty (maybe fakebank.example was actually a Greek yoghurt company subject to a PHP 4.x attack, and they upgraded PHP and removed the phishing site so now it's fine) but if it has the entire SHA256 hash we calculated then I get an alert telling me that my browser thinks this is a phishing site and I might want to not visit.

heavenlyblue · on July 21, 2019

Why can’t the browser request the mapping of the hash to the malicious website that is known and then check if that mapping is indeed what it is seeing?

amluto · on July 21, 2019

No, it doesn’t. The equivalent Firefox feature does not reveal your URLs to Mozilla.

https://support.mozilla.org/en-US/kb/how-does-phishing-and-m...

bogwog · on July 21, 2019

or you could just, you know, use Firefox

TanakaTarou · on July 21, 2019

True, but nothing beats MS Edge if you have a touch screen and that it a good enough reason for me to stay with Edge.

ebg13 · on July 21, 2019

Can you save text files that you're looking at in Edge yet? Like if you navigate to a URL that goes https://foo.com/blah.txt can you hit Ctrl+S or right-click on the text page and save the document? The last time I looked at Edge the answer was no.

Santosh83 · on July 21, 2019

Seems like you can't save ANY page in Edge. Absurd lack of functionality.

derimagia · on July 21, 2019

Fired up Edge Insider (The Chromium version). Works fine.

stallmanite · on July 21, 2019

Wow how broken is that? Wondering what the heck was the reasoning there.

currysausage · on July 23, 2019

I guess that most people who know that you can save web pages to your disk (and make use of that feature) also know that you can disable telemetry (and make use of that feature).

kryogen1c · on July 21, 2019

Not to defend Microsoft, but SIDs are non unique. I think theyre only guaranteed to be unique per AD forest with the addition of the RID subauthority. Lots of domains are domain.local or other such commonalities and so may have duplicates. Microsoft would know this, so the intent does not appear surveillance or data collection.

gruez · on July 21, 2019

According to wikipedia, a SID looks like this

    S-1-5-21-3623811015-3361044348-30300820-1013

The 3 long numbers in the middle are called "Domain or local computer identifier". Further down it says that each of those numbers are encoded as a 32 bit integer. Assuming they're randomly generated, that's 96 bits of entropy, which is more than enough to uniquely identify every computer on the planet[1].

[1] There might be a few duplicates due to the birthday paradox.

yjftsjthsd-h · on July 21, 2019

> Microsoft would know this, so the intent does not appear surveillance or data collection.

Personally identifiable data collection. The goal is obviously data collection.

gumby · on July 21, 2019

This is one of the reasons I don't use chrome. Google tracks which search links I click on any browser but I assume they simply track everything else done with Chrome.

auslander · on July 21, 2019

Not so easy to do this mitm trick with Chrome, it has google certs pinned down. I think Chrome sends no less info to Google.

kerng · on July 21, 2019

Poster might use preview versions, where Microsft does these things (not just Edge) if I recall correctly.

thebeefytaco · on July 21, 2019

Anyone know if "full URL" here would include a hash from the url?

reaperducer · on July 21, 2019

Every time I see people on HN campaigning for Microsoft as some kind of reformed tech company that has seen the errors of its past and turned into a force for good, something like this comes up.

Same old Microsoft. Second verse same as the first.

alkonaut · on July 21, 2019

The good part of Microsoft is the division responsible for languages, compilers and cloud. Meanwhile an old Microsoft makes Windows, Office, etc.

While the latter has improved somewhat, only the former has reformed.

NullPrefix · on July 21, 2019

>The good part of Microsoft is the division responsible for languages, compilers and cloud.

Putting their telemetry in your compiled binaries is ... real good? Marketing is the only thing that changed.

alkonaut · on July 21, 2019

Yes. I have absolutely no problem with the way that works. It’s clearly spelled out how it works and how it’s configured. Seems few who actually use these tools have problems with this...

TiredOfLife · on July 21, 2019

Source?

Or are you talking about when they enabled generation of ETW events (as in used by YOU in OFFLINE to debug your app.) generation by default?

People like you spreading fud make it difficult to discuss actual cases (like this appears to be)

GrayShade · on July 21, 2019

Code and .NET Core both send the same hash of the MAC (which at 6 bytes is hardly worth hashing) in their telemetry in order to correlate across applications.

userbinator · on July 21, 2019

> Meanwhile an old Microsoft makes Windows, Office, etc.

That's the "new" Microsoft... I'm not sure if there's anything left of the old one anymore.

okasaki · on July 21, 2019

Not sure if it's bought or organic, but no other company gets the response Microsoft gets. The smallest triviality get praised, the biggest flaws get rationalized away. Like, people here were going wild when they added UNIX line-ending support to notepad.

pessimizer · on July 21, 2019

They employ a lot of people directly, and a lot of people are employed in their ecosystem. I'd deny that this is exclusively for MS, though. Every large company has an army of employees that come in to explain how things actually are. On HN, this happens just as often for every medium-sized company; the only companies who absolutely get bulldozed when they do something worth getting bulldozed for are tiny, because their defenses just get shouted down by the employees of competitors.

jodrellblank · on July 21, 2019

https://news.ycombinator.com/newsguidelines.html - "Please don't make insinuations about astroturfing".

bdcravens · on July 21, 2019

Microsoft has been a company about aggressive lock-in. This feels more like Microsoft being the same old Google.

jodrellblank · on July 21, 2019

This is "the same verse" as what? Bundling IE with Windows? The same as making proprietary technologies on open ones? The same as strongarming Gary Kildall? No, no it isn't "the same". Anti M$ comment is just whiny.

This behaviour is part of SmartScreen, it's publicly known about, documented, optional. It's explicitly called out somewhere - when you install a new Windows 10 and it takes you through the privacy settings, there's mention of sending browsing data to Microsoft if you agree - https://1re4xlezju-flywheel.netdna-ssl.com/wp-content/upload... - and that's been a warning in the IE first run dialog since before IE 11 - http://www.herongyang.com/Windows/IE_10_SmartScreen_Filter.j...

> When checking web content, data about the content and your device is sent to Microsoft, including the full web address of the content.

https://privacy.microsoft.com/en-us/privacystatement#mainsec...

SmartScreen for Windows Defender will send filenames, hashes and download locations to Microsoft as well.

This is less bad than "like" buttons tracking you around the web in an undocumented and not-optional way with no user benefit.

Is this so different to Google quietly signing you into Chrome when you sign into a Google website, and Chrome sync'ing your browser history to Google by default and sending URLs of pages you visit from the Omnibox?[1] Or "Sends URLs of some pages that you visit to Google, when your security is at risk" from the Chrome options? To Facebook scraping call records and SMS texts from Android devices for years? [2] Ubuntu with opt-out system data gathering in 18.04, and wants to gather data to track "relative popularity of apps"[3], or Steam gathering data about what games you play[4]. Would you believe, Wonderful helpful Google has every email you send or receive through a GMail account or to one, but Evil Microsoft has every email you send through/to an Office 365 account. Wonderful Google and Amazon could access any VM on their clouds, so could evil Microsoft on Azure - except Microsoft built Shielded VMs into Hyper-V to protect the VM from compromised hosts, and to keep secrets in the VM which host administrators cannot access. [5]

Tech companies spy on people for money, isn't as anti Microsoft an idea though, is it?

[1] https://www.google.com/intl/en/chrome/privacy/whitepaper.htm...

[2] https://www.theverge.com/2018/3/25/17160944/facebook-call-hi...

[3] https://www.omgubuntu.co.uk/2018/02/ubuntu-data-collection-o...

[4] https://arstechnica.com/gaming/2018/07/steam-data-leak-revea...

[5] https://docs.microsoft.com/en-us/windows-server/security/gua...

pessimizer · on July 21, 2019

It's so weird to come in with a defense of "this is no worse than this long list of horrible things other companies have done, and other horrible things this company has done before!"

I completely agree with you, this is no worse than many of those things.

jodrellblank · on July 21, 2019

That's not what I'm defending. I'm attacking "same old Microsoft" as a content-free, /inaccurate/, /ill targetted/ and /boring/ comment.

"Downvoters should explain" well that "defense" is my explanation. It's not the "same old" behaviour, it's different, what it is the same as .. is other current day, much loved technology companies.

Except for specific ways it's better in Microsoft's favour - this is more explicit and easier to opt-out of than Google Analytics, Doubleclick (owned by Google) or Google+ social media tracking beacons, for one, or "talking to someone who uses gmail" which is super frustrating, for another.

gdulli · on July 21, 2019

I would never assume Google doesn't know every site I visit with Chrome. Let's scrutinize all of them rather than hold them to different standards based on our personal narrative.