These aren't sorted by number of visits, but by the number of rows in the list of most visited sites. Essentially which sites have the greatest number of frequently visited subdomains.
Wikipedia uses a .org domain, so it won't show up on "most popular .com domains" lists. (And I think the parent comment is searching for domains with lots of subdomains, which is why providers like Blogspot and Fandom show up.)
Curious where PornHub and other sites rank. I always hear how that porn sites are in the top X of all traffics but people don’t talk about due to its nature.
I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.
"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."
does it though? Pretty sure adult results are being filtered off by Google tool named "SafeSearch". It removes anything adult from SERP and it is on by default.
This appears to have some unintuitive consequences. When I searched for "porn" in a cookie-less session just now, there were still porn results, but no well-known sites (at least I didn't recognize the names). Searching for literally "pornhub", the first result is "porhub.com" without the "n".
Seems like the "SafeSearch" filter is based on a list of "adult domains" instead of the indexed content at the URL.
If you ignore the content, large-scale adult sites are just like any other high traffic (bandwidth, RPS) site out there. A lot of planning goes into where their content delivery PoPs should be placed.
"New Year’s Eve kicked holiday ass with a massive –40% drop in worldwide traffic from 6pm to Midnight on December 31st." It's Dec/31, 1pm in New York right now.
I remember reading about their experience with Redis https://groups.google.com/g/redis-db/c/d4QcWV0p-YM there is something funny about reading engineering insights from a porn co, but they do deal with scale that not many others do!
"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."
> One would think the download page is blocked as well
Contrary to popular belief, Google only pulled Search business out of China. The rest of services is still hosted on Google.cn inside China. To download Chrome:
* Connected to www.google.cn (180.163.150.34) port 443 (#0)
However the "Make searches and browsing better (Sends URLs of pages you visit to Google)" data won't be collected, because the connection would be blocked.
I assume the data is aggregate across all devices. Chrome has 60% of desktop usage in China. But less than 10% on Mobile.
But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.
Wow, I'm kinda surprised to find my site in the top million worldwide. I have about 100k monthly visits as measured by Cloudflare web analytics, I guess that's all it takes.
If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.
Looking at:
> Processed 4.60 thousand rows, 273.86 MB
I'd guess it's chunking the rows into groups of ~4,000.
The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)
EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)
How about websites that are browsed http first and then redirected? People might browse for a domain without the https prefix for convenience (or old links) and the browser defaults to http.
The accessibility challenges are all the extra different failure modes HTTPS presents, such as client date offset, older devices, expired certificates, hostname mismatches, and many others.
Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.
> Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.
I suppose you'd be fine if your government started replacing the content of Wikipedia with their own propaganda/removing critical information about themselves from traffic?
how? Edits on Wikipedia are public, including historical monthly backups available over bt all the way back to 2006, and I can ensure Wikipedia servers are serving it correctly by cross-referencing that and the edits. With http, any ISP (whose operators all tend to favor government cooperation) or switch in the middle could sed content to remove or slightly alter known-critical content.
yeah... i havent gotten a good response why localhost should scream "insecure" or why i should wikipedia fail if my rtc clock is wonky.
i am not denying "security from snoops while paying with credit cards" and all that banking shit or messaging. heck, email is sent over the clear but we are told to use https to connect to the website (for webmails) using https for "security"...
sure sure security is all good and snazzy but i regularly come across websites who have had certs expired and the website makes it appear as if the sky will fall if i click on continue.
then we have ISPs who use DPI (my current ISP, reliance jio is doing it from day 1) so whats the point of pretending anyway?
This is very ethically dubious. Google is collecting raw URLs from Chrome users who turned on history syncing across their own devices, then reusing the data and funneling it through Stanford. No way Chrome users understand or approve of this.
The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.
Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.
Can you please make your substantive points without breaking the site guidelines? You did that here with your last paragraph, and worse at https://news.ycombinator.com/item?id=34197958.
Hi dang, I'm new here so I'd appreciate clarification.
Someone defending this privacy debacle on Hacker News is a Google employee on the Chrome team and was a business cofounder with the Stanford collaborator. That person not only failed to identify how very close they are to the topic, but also phrased their comment in a way that falsely represented distance from the topic. It seems to me essential for understanding their misleading comment to be aware of the factual context.
I thought I had phrased this factual correction in a way that was neutral and not a personal attack. My assumption was that the commenter may have violated Hacker News guidelines by being so misleading. What did I do wrong?
As for the downvotes, I see that I should have emailed you rather than adding a note in the comment. Nonetheless, could you see what's going on?
The commenter publicly identifies themselves in their HN profile and you're using that to attack them. It's completely backwards to say they've misrepresented anything. The essential thing is to assume good faith and not go on weird innuendo-laden witch-hunts.
It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but the default impact of doing this is to poison discussion so badly that the default bias has to be "don't do it". Certainly you should not be doing it as part of a flamewar post, which your comments in this thread have been. We want curious conversation here, not people cross examining each other.
I'm not disagreeing with you about the underlying issue—there's an argument to be made that the kind of "publishing" that Google/Chrome does here is is really a way of obscuring it from the majority of users, and so on. HN commenters are certainly welcome to make that kind of argument. But we need you to err on the side of not posting in the flamewar style. If I see a commenter posting in the flamewar style and then also bringing in someone's personal details as ammunition, it's no longer a tough-thing-to-balance, it's just out of line.
"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."
This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.
> Google has written publicly about how this system works
If this is news to Hackers News, there is no way that regular Chrome users are aware of it. Saying something in a privacy policy or on a developer website just can't be enough for analyzing a person's URL data.
> This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.
Since when does aggregating this type of data make it fair game? This is analyzing a person's URL data from their own devices. There has always been a big bright red line for browsers touching a user's browsing history. Google crossed that line.
Also, I just checked on a fresh Chrome install. The "Make searches and browsing better" option is enabled by default and buried in Chrome settings. How is that acceptable consent for analyzing a person's URL data?
Edit: you've unfortunately been breaking the site guidelines a ton lately. Seriously not cool, and well past the line at which we start banning the account.
> This might be new to you, but that does not mean it's some new information that's been hidden.
I downloaded Chrome on a new laptop an hour ago (at my employer's request, I'd use Firefox myself) and was certainly not aware of this.
This information was not on any screen at any point. There was a default-checked checkmark for some general statistics sharing which I only noticed after clicking download (because it was small and below the download button), but didn't click through to the privacy policy to learn more.
Guess I should have read the privacy policy. I'm trying to find what it said now, but I can't see it anymore because different terms apply to Linux downloads and there's no button to download the Windows version. Basically, visiting the same page in Firefox on Linux (instead of Edge on Windows, which I don't have access to atm) gives me different contents and no checkmark.
Is it opt-in or opt-out? And if it's opt-in, does it come with infinite nagging until you opt-in?
I know login-in and syncing your data are "opt-in" options that come with infinite nagging (so, actually required options). The information that there are different levels of syncing is news to me.
What is it you are proposing? If it were every institutions obligation to make sure, that all its instrumental functions were obvious to every potential user and keep any user from engaging with the institution under any false assumptions, nothing in our society would work.
That it not to say that scrutiny is not important. You should certainly be allowed to point at any individual function and demand more upfront transparency, over what is currently being offered. But be aware of the massive additional cognitive load you create, for everyone, when you are not just demanding information availability, but that this information is being delivered to anyone it might concern. Any individuals preference to not care about a function would have to take the backseat to the opinion, that they have to at least somewhat consider the function before engaging.
Considering how expensive this process is, "Google Chrome CrUX" would probably be pretty far down on the list for me personally, as "crucial things everyone should definitely know about before possibly engaging" goes, but to each their own.
I could see two main arguments for this not being okay:
* Chrome is secretly collecting data.
* Chrome is doing something users would object to if they knew and understood it.
I don't think either of these are the case here: they are sharing data about what sites people generally visit in an aggregated form doesn't reveal any individual's browsing (what's to object to?) and talk about it in the place people would go to learn about what data they collect.
> It's fine that this is all new to you, but it's not new to you because anyone has kept this secret. At this point, you've chosen to remain ignorant.
Ah yes. Blame the user for not understanding yet another piece in Google's gargantuan data collecting machinery.
Recent court cases revealed that Google's own employees don't know what's tracked and how to turn it off. But I'm sure it's only ignorance that keeps users uninformed.
I very much agree with you. This type of data collection MUST be opt-in to be ethical, and in Chrome it’s enabled by default and buried. The VAST majority of users have no idea this is even happening. It is grossly unethical and it is obvious that it is so, but unsurprisingly folks at Google are happy to do things like this given their salaries.
Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.
There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.
I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.
The Guidelines are clear about why we're here and expectations. The emphasis is on discussion, learning and objectivity. Yes, disagreement is mentioned (i.e., allowed) but even that needs to be constructive, yes?
A down vote - with no discussion - well, frankly in the context of the Guidelines, is:
1) Not in the spirit of the guidelines;
2) Perhaps redundant to 1, but lazy;
3) At best, small-minded and childish;
If people want to pout about reading something they don't like, this isn't the place for them.
Yeah, I see who you are. And I'm ok w/ pushing back. That's what make HN what it is ;)
How does that feel? What value does it add? (Sweet FA, eh.)
You're right, you might be right. But that does make it right. I get zero satisfaction from context-less down votes. I don't do them. I ignore them when I get them (i.e., they have zero influence on my HN behavior). If I'm changing my mind over some lazy a-holes' click, I'm losing. Big time.
I can't imagine why anyone feels any differently. The reality is, there are pointless noise. There's not enough context to drive anything actionable for anybody.
But while I have your attention: how about a feature request: Karma points that consider the discussion below a top-parent comment.
> ” If Apple or Mozilla did anything remotely like this, Hacker News would riot.”
My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.
Just suggesting that prior browser and OS privacy blowups involving those companies have been over less worrisome things, not that those companies are subject to more or less criticism. Looking back on outraged discussions of Mozilla's telemetry is kinda quaint in comparison.
Because Google is a web advertisement company that dominates many large spheres: search, browsers (including standards committees), email, mobile (Android is 77% market share) etc. All are things that we've come to view as crucial to modern life.
And time and again they've shown that they only view that dominance as a funnel for ad revenue, data collection, and whatever benefits them at this particular moment.