Hacker News new | past | comments | ask | show | jobs | submit login
Cached Chrome Top Million Websites (github.com/zakird)
261 points by edent on Dec 31, 2022 | hide | past | favorite | 97 comments



Top level domains by popularity:

    grep -oP '\.[a-z]+(?=,)' current.csv | sort | uniq -c | sort -n

  ...
  15840 .pl
  17914 .it
  20182 .de
  21690 .in
  27812 .ru
  29194 .jp
  30359 .org
  35741 .br
  36675 .net
 406052 .com
.com domains by popularity:

    grep -oP '[a-z0-9-]+\.com(?=,)' current.csv | sort | uniq -c | sort -n

    ...
    365 tistory.com
    370 fc2.com
    408 skipthegames.com
    489 online.com
    515 wordpress.com
    707 uptodown.com
    880 schoology.com
   2570 fandom.com
   2651 instructure.com
   3244 blogspot.com


It might be worth updating this comment and explaining your second query.

People seem to think it is somehow measuring visits to those origins. But it's measuring how many unique subdomains are listed for those domains


Rather amazing seeing almost abandoned blogspot.com there at the top.

Also interesting I haven’t heard about half of them. Some are nsfw, apparently.


These aren't sorted by number of visits, but by the number of rows in the list of most visited sites. Essentially which sites have the greatest number of frequently visited subdomains.


Loading the data into the duckdb cli [0] and doing the first query:

    create table current as select * from '202211.csv';
    select * from current;
    ┌────────────────────────────────────┬─────────┐
    │               origin               │  rank   │
    │              varchar               │  int32  │
    ├────────────────────────────────────┼─────────┤
    │ https://hochi.news                 │    1000 │
    │ https://www.xnxx.xxx               │    1000 │
    │ https://www.wordreference.com      │    1000 │
    │ https://finance.naver.com          │    1000 │
    │ https://www.macys.com              │    1000 │
    │ https://www.xv-videos1.com         │    1000 │
    │ https://fr.xhamster.com            │    1000 │
    │ https://poki.com                   │    1000 │
    │ https://salonboard.com             │    1000 │
    │ https://clgt.one                   │    1000 │


    select tld, count(*) 
    from (select reverse(substr(reverse(origin),1, position('.' in reverse(origin))-1)) tld 
            from current) 
    group by tld 
    order by count(*) desc;
    ┌───────────┬──────────────┐
    │    tld    │ count_star() │
    │  varchar  │    int64     │
    ├───────────┼──────────────┤
    │ com       │       406052 │
    │ net       │        36675 │
    │ br        │        35741 │
    │ org       │        30359 │
    │ jp        │        29194 │
    │ ru        │        27812 │
    │ in        │        21690 │
    │ de        │        20182 │
    │ it        │        17914 │
    │ pl        │        15840 │
    │ ·         │            · │
    │ ·         │            · │
    │ ·         │            · │
    │ za:5002   │            1 │
    │ lk:8090   │            1 │
    │ org:1445  │            1 │
    │ co:14443  │            1 │
    │ ar:3016   │            1 │
    │ net:8001  │            1 │
    │ care:9624 │            1 │
    │ au:8443   │            1 │
    │ com:333   │            1 │
    │ edu:9016  │            1 │
    ├───────────┴──────────────┤
    │   2076 rows (20 shown)   │
    └──────────────────────────┘
[0] https://duckdb.org/docs/installation/


Are the bottom entries domain:port? I feel like the table would be more interesting excluding those.


sort -r will reverse the order from most to less popular.


.


uniq -c is in the pipeline because it counts the number of uniques


Ah, right, missed that.


it's amazing that fandom is number 3 and wikipedia is not even there


Wikipedia uses a .org domain, so it won't show up on "most popular .com domains" lists. (And I think the parent comment is searching for domains with lots of subdomains, which is why providers like Blogspot and Fandom show up.)


Curious where PornHub and other sites rank. I always hear how that porn sites are in the top X of all traffics but people don’t talk about due to its nature.

I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.

https://www.pornhub.com/insights/2022-year-in-review


"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."


A quick grep shows that there are almost 2.5K domains at 1M rank with “porn” in their name.

The data science teams likely provide a considerable ROI in that industry.


Majority of porn sites do not have porn word in them though. I wish there was a categorization of these.


does it though? Pretty sure adult results are being filtered off by Google tool named "SafeSearch". It removes anything adult from SERP and it is on by default.


This appears to have some unintuitive consequences. When I searched for "porn" in a cookie-less session just now, there were still porn results, but no well-known sites (at least I didn't recognize the names). Searching for literally "pornhub", the first result is "porhub.com" without the "n".

Seems like the "SafeSearch" filter is based on a list of "adult domains" instead of the indexed content at the URL.


If you ignore the content, large-scale adult sites are just like any other high traffic (bandwidth, RPS) site out there. A lot of planning goes into where their content delivery PoPs should be placed.


"New Year’s Eve kicked holiday ass with a massive –40% drop in worldwide traffic from 6pm to Midnight on December 31st." It's Dec/31, 1pm in New York right now.


I remember reading about their experience with Redis https://groups.google.com/g/redis-db/c/d4QcWV0p-YM there is something funny about reading engineering insights from a porn co, but they do deal with scale that not many others do!


"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."


Looks like not a single Chinese site made to top 1k. I guess it's reasonable because all Google services were blocked so CrUX can't gather any data.


Do Chinese people use chrome? One would think the download page is blocked as well, so the demographic for chrome users should be way smaller.

Also to consider: China uses in-app browsing a lot, with interactive experiences very similar to websites built right in the bilibili/ali/wechat apps.


> One would think the download page is blocked as well

Contrary to popular belief, Google only pulled Search business out of China. The rest of services is still hosted on Google.cn inside China. To download Chrome:

$ curl -svk 'https://www.google.cn/intl/zh-CN/chrome/'

* Trying 180.163.150.34...

* TCP_NODELAY set

* Connected to www.google.cn (180.163.150.34) port 443 (#0)

However the "Make searches and browsing better (Sends URLs of pages you visit to Google)" data won't be collected, because the connection would be blocked.


> in-app browsing

But that's also just chromium isn't it, much like a PWA? Unless they made something of their own.


But does chromium phone home with this data? Surely not, I would've thought?


I assume the data is aggregate across all devices. Chrome has 60% of desktop usage in China. But less than 10% on Mobile.

But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.


> The CrUX dataset is based on data collected from Google Chrome and is thus biased away from countries with limited Chrome usage (e.g., China).


Chrome is the dominate browser in China. Even if not Chrome, then it's Chromium-based alternative UI shells.


It's fascinating (and says a lot) that Google's internet monopoly persists even in places where it's outright banned.


Wow, I'm kinda surprised to find my site in the top million worldwide. I have about 100k monthly visits as measured by Cloudflare web analytics, I guess that's all it takes.


If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

Description: https://github.com/ClickHouse/ClickHouse/issues/18842

You can easily try it with clickhouse-local without downloading:

  $ curl https://clickhouse.com/ | sh

  $ ./clickhouse local 
    ClickHouse local version 22.13.1.294 (official build).

    milovidov-desktop :) DESCRIBE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    DESCRIBE TABLE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc

    ┌─name────┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
    │ rank    │ UInt32 │              │                    │         │                  │                │
    │ domain  │ String │              │                    │         │                  │                │
    │ log     │ String │              │                    │         │                  │                │
    │ content │ String │              │                    │         │                  │                │
    └─────────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

    4 rows in set. Elapsed: 1.390 sec. 

    milovidov-desktop :) SELECT rank, domain, log, substringUTF8(content, 1, 100) FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1 FORMAT Vertical

    SELECT
        rank,
        domain,
        log,
        substringUTF8(content, 1, 100)
    FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')
    LIMIT 1
    FORMAT Vertical

    Query id: 8dba6976-0bf6-4ce8-a0f1-aa579c828175

    Row 1:
    ──────
    rank:                           1907977
    domain:                         0--0.uk
    log:                            *   Trying 213.32.47.30:80...
    * Connected to 0--0.uk (213.32.47.30) port 80 (#0)
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 302 Moved Temporarily
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:14 GMT
    < Content-Type: text/html
    < Content-Length: 154
    < Connection: keep-alive
    < Location: https://0--0.uk/
    < 
    * Ignoring the response-body
    { [154 bytes data]
    * Connection #0 to host 0--0.uk left intact
    * Issue another request to this URL: 'https://0--0.uk/'
    *   Trying 213.32.47.30:443...
    * Connected to 0--0.uk (213.32.47.30) port 443 (#1)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    *  CAfile: /etc/ssl/certs/ca-certificates.crt
    *  CApath: /etc/ssl/certs
    * TLSv1.0 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    { [108 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Certificate (11):
    { [4150 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    { [333 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server finished (14):
    { [4 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    } [70 bytes data]
    * TLSv1.2 (OUT), TLS header, Finished (20):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
    } [1 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Finished (20):
    } [16 bytes data]
    * TLSv1.2 (IN), TLS header, Finished (20):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Finished (20):
    { [16 bytes data]
    * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
    * ALPN, server accepted to use http/1.1
    * Server certificate:
    *  subject: CN=mail.htservices.co.uk
    *  start date: May 15 18:36:37 2022 GMT
    *  expire date: Aug 13 18:36:36 2022 GMT
    *  subjectAltName: host "0--0.uk" matched cert's "0--0.uk"
    *  issuer: C=US; O=Let's Encrypt; CN=R3
    *  SSL certificate verify ok.
    * TLSv1.2 (OUT), TLS header, Supplemental data (23):
    } [5 bytes data]
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * TLSv1.2 (IN), TLS header, Supplemental data (23):
    { [5 bytes data]
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:15 GMT
    < Content-Type: text/html;charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < X-Frame-Options: SAMEORIGIN
    < Expires: -1
    < Cache-Control: no-store, no-cache, must-revalidate, max-age=0
    < Pragma: no-cache
    < Content-Language: en-US
    < Set-Cookie: ZM_TEST=true;Secure
    < Set-Cookie: ZM_LOGIN_CSRF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly
    < Vary: User-Agent
    < X-UA-Compatible: IE=edge
    < Vary: Accept-Encoding, User-Agent
    < 
    { [13068 bytes data]
    * Connection #1 to host 0--0.uk left intact

    substringUTF8(content, 1, 100): <!DOCTYPE html>
    <!-- set this class so CSS definitions that now use REM size, would work relative to

    1 row in set. Elapsed: 0.539 sec. Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s., 508.28 MB/s.)


How does that work? How can clickehouse-local run queries against a 129 GB file hosted on S3 without downloading the whole thing?

Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html

If so, what's the data.native.zst file format? Is it similar to Parquet?


Yes, the native format is very similar to Parquet.

It works for Parquet as well:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits.parquet') LIMIT 1
And for CSV or TSV:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/tsv/github_events_v3.tsv.xz') LIMIT 1
And for ndJSON:

  SELECT repo_name, created_at, event_type FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/github_events_*.gz', JSONLines, 'repo_name String, actor_login String, created_at String, event_type String') WHERE actor_login = 'simonw' LIMIT 10  
Note: the query above is kind of slow. Here is the query from preloaded data - your activity in GitHub issues:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z...


Another question about that demo.

https://clickhouse.com/docs/en/getting-started/example-datas... says "Dataset contains all events on GitHub from 2011 to Dec 6 2020" - but I'm seeing results in there from a couple of hours ago.

Do you know if that's continually updated and, if so, is that documented anywhere?


Yes, it's continuously updated.

The source code is here: https://github.com/ClickHouse/github-explorer

This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...


Thanks for the info! I wrote this up as a TIL: https://til.simonwillison.net/clickhouse/github-explorer


> How does that work?

Disclaimer: I'm not a Clickhouse user, but I have a bit of experience with Parquet.

It looks like the native format is (very briefly) described here: https://clickhouse.com/docs/en/interfaces/formats/#native

It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.

Looking at:

> Processed 4.60 thousand rows, 273.86 MB

I'd guess it's chunking the rows into groups of ~4,000.

The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)

EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)


> grep http: current.csv | wc -l

54679

So over 5% of the top 1m sites still don't use HTTPS.


I have prepared a nice report: the rank of the websites in groups 1..10, 11..100, ... the percentage of TLS and an example of non-TLS website:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK...

    SELECT
        floor(log10(rank)) AS r,
        count() AS total,
        sum(log LIKE '%TLS%') AS tls,
        round(tls / total, 2) AS ratio,
        anyIf(domain, log NOT LIKE '%TLS%')
    FROM minicrawl
    WHERE log LIKE '%Content-Length:%'
    GROUP BY r
    ORDER BY r

    ┌─r─┬───total─┬─────tls─┬─ratio─┬─anyIf(domain, notLike(log, '%TLS%'))─┐
    │ 0 │       6 │       6 │     1 │                                      │
    │ 1 │      61 │      58 │  0.95 │ baidu.com                            │
    │ 2 │     599 │     562 │  0.94 │ google.cn                            │
    │ 3 │    5591 │    5057 │   0.9 │ volganet.ru                          │
    │ 4 │   51279 │   44291 │  0.86 │ furbo.co                             │
    │ 5 │  476181 │  361910 │  0.76 │ funygold.com                         │
    │ 6 │ 3797023 │ 2927052 │  0.77 │ funyo.vip                            │
    └───┴─────────┴─────────┴───────┴──────────────────────────────────────┘

    7 rows in set. Elapsed: 0.844 sec. Processed 7.59 million rows, 43.74 GB (8.99 million rows/s., 51.83 GB/s.)


Excuse my ignorance, what CLI tool did you use to execute this query? Thanks!


clickhouse-client

Download it as:

  curl https://clickhouse.com/ | sh
Connect to the demo service:

  clickhouse-client --host play.clickhouse.com --user play --secure


If I am not mistaken, 8310 sites offer http and https:

    grep -o -E "://.*?," current.csv | sort | uniq -c | grep -v "1 ://" | wc -l
    8310


How about websites that are browsed http first and then redirected? People might browse for a domain without the https prefix for convenience (or old links) and the browser defaults to http.


“The 5% rule”


[flagged]


What accessibility challenges does https pose?


The accessibility challenges are all the extra different failure modes HTTPS presents, such as client date offset, older devices, expired certificates, hostname mismatches, and many others.

Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.


> Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.

I suppose you'd be fine if your government started replacing the content of Wikipedia with their own propaganda/removing critical information about themselves from traffic?


So far they haven't. Meanwhile, millions of people with older devices cannot use them to access wikipedia.


you're aware that your government is already doing this, right ? (your argument is invalid)


how? Edits on Wikipedia are public, including historical monthly backups available over bt all the way back to 2006, and I can ensure Wikipedia servers are serving it correctly by cross-referencing that and the edits. With http, any ISP (whose operators all tend to favor government cooperation) or switch in the middle could sed content to remove or slightly alter known-critical content.


yeah... i havent gotten a good response why localhost should scream "insecure" or why i should wikipedia fail if my rtc clock is wonky.

i am not denying "security from snoops while paying with credit cards" and all that banking shit or messaging. heck, email is sent over the clear but we are told to use https to connect to the website (for webmails) using https for "security"...

sure sure security is all good and snazzy but i regularly come across websites who have had certs expired and the website makes it appear as if the sky will fall if i click on continue.

then we have ISPs who use DPI (my current ISP, reliance jio is doing it from day 1) so whats the point of pretending anyway?


> why localhost should scream "insecure"

Localhost, even with HTTP, is a secure context: https://developer.mozilla.org/en-US/docs/Web/Security/Secure...

What tool is screaming at you that localhost is insecure?


They may be using a self-signed cert so it’s https://localhost and the browser is flagging the cert rather than localhost itself.


browsers. padlock icon is crossed out


I just tested this and don't see that. I compared http://neverssl.com to running "python3 -m http.server" and visiting http://localhost:8000

* Chrome: "Not Secure" on neverssl, "i" in a circle on localhost

* Firefox: Padlock with a red line through it on neverssl, page icon on localhost

* Safari: "Not Secure" on neverssl, no message on localhost


> email is sent over the clear but we are told to use https to connect to the website (for webmails) using https for "security"

This is not true, most is encrypted in transit. It is not end to end, because your email service stores them (perhaps encrypted perhaps not).

Edit: https://transparencyreport.google.com/safer-email/overview?h...

You can see 84% of outbound is encrypted. This probably is generally a good proxy for the state of email tls transport.


This raises the question: how much in the way of user telemetry does Chrome send back to google?


By default, a lot. However, they also are (or at least used to be, it seems to be quite outdated now) really good at documenting their telemetry publicly: https://www.google.com/chrome/privacy/whitepaper.html

(I haven't checked whether the documentation is complete/accurate, of course.)


Thanks for pointing this out. Can definitely put this dataset to use


This is very ethically dubious. Google is collecting raw URLs from Chrome users who turned on history syncing across their own devices, then reusing the data and funneling it through Stanford. No way Chrome users understand or approve of this.

The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.

Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.


Can you please make your substantive points without breaking the site guidelines? You did that here with your last paragraph, and worse at https://news.ycombinator.com/item?id=34197958.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


Hi dang, I'm new here so I'd appreciate clarification.

Someone defending this privacy debacle on Hacker News is a Google employee on the Chrome team and was a business cofounder with the Stanford collaborator. That person not only failed to identify how very close they are to the topic, but also phrased their comment in a way that falsely represented distance from the topic. It seems to me essential for understanding their misleading comment to be aware of the factual context.

I thought I had phrased this factual correction in a way that was neutral and not a personal attack. My assumption was that the commenter may have violated Hacker News guidelines by being so misleading. What did I do wrong?

As for the downvotes, I see that I should have emailed you rather than adding a note in the comment. Nonetheless, could you see what's going on?


by being so misleading.

The commenter publicly identifies themselves in their HN profile and you're using that to attack them. It's completely backwards to say they've misrepresented anything. The essential thing is to assume good faith and not go on weird innuendo-laden witch-hunts.


It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but the default impact of doing this is to poison discussion so badly that the default bias has to be "don't do it". Certainly you should not be doing it as part of a flamewar post, which your comments in this thread have been. We want curious conversation here, not people cross examining each other.

I'm not disagreeing with you about the underlying issue—there's an argument to be made that the kind of "publishing" that Google/Chrome does here is is really a way of obscuring it from the majority of users, and so on. HN commenters are certainly welcome to make that kind of argument. But we need you to err on the side of not posting in the flamewar style. If I see a commenter posting in the flamewar style and then also bringing in someone's personal details as ammunition, it's no longer a tough-thing-to-balance, it's just out of line.

"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."

https://news.ycombinator.com/newsguidelines.html


Google has written publicly about how this system works: https://developer.chrome.com/docs/crux/methodology/ https://www.google.com/chrome/privacy/whitepaper.html#usages...

This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.


> Google has written publicly about how this system works

If this is news to Hackers News, there is no way that regular Chrome users are aware of it. Saying something in a privacy policy or on a developer website just can't be enough for analyzing a person's URL data.

> This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.

Since when does aggregating this type of data make it fair game? This is analyzing a person's URL data from their own devices. There has always been a big bright red line for browsers touching a user's browsing history. Google crossed that line.

Also, I just checked on a fresh Chrome install. The "Make searches and browsing better" option is enabled by default and buried in Chrome settings. How is that acceptable consent for analyzing a person's URL data?


[flagged]


Please don't cross into personal attack.

https://news.ycombinator.com/newsguidelines.html

Edit: you've unfortunately been breaking the site guidelines a ton lately. Seriously not cool, and well past the line at which we start banning the account.

I don't want to ban you, but if you keep this up, we'll have to. If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules, we'd appreciate it.


> This might be new to you, but that does not mean it's some new information that's been hidden.

I downloaded Chrome on a new laptop an hour ago (at my employer's request, I'd use Firefox myself) and was certainly not aware of this.

This information was not on any screen at any point. There was a default-checked checkmark for some general statistics sharing which I only noticed after clicking download (because it was small and below the download button), but didn't click through to the privacy policy to learn more.

Guess I should have read the privacy policy. I'm trying to find what it said now, but I can't see it anymore because different terms apply to Linux downloads and there's no button to download the Windows version. Basically, visiting the same page in Firefox on Linux (instead of Edge on Windows, which I don't have access to atm) gives me different contents and no checkmark.


Is it opt-in or opt-out? And if it's opt-in, does it come with infinite nagging until you opt-in?

I know login-in and syncing your data are "opt-in" options that come with infinite nagging (so, actually required options). The information that there are different levels of syncing is news to me.


If a user can somehow someway somewhere learn about what a company is doing, then it’s OK? Really?


What is it you are proposing? If it were every institutions obligation to make sure, that all its instrumental functions were obvious to every potential user and keep any user from engaging with the institution under any false assumptions, nothing in our society would work.

That it not to say that scrutiny is not important. You should certainly be allowed to point at any individual function and demand more upfront transparency, over what is currently being offered. But be aware of the massive additional cognitive load you create, for everyone, when you are not just demanding information availability, but that this information is being delivered to anyone it might concern. Any individuals preference to not care about a function would have to take the backseat to the opinion, that they have to at least somewhat consider the function before engaging.

Considering how expensive this process is, "Google Chrome CrUX" would probably be pretty far down on the list for me personally, as "crucial things everyone should definitely know about before possibly engaging" goes, but to each their own.


I could see two main arguments for this not being okay:

* Chrome is secretly collecting data.

* Chrome is doing something users would object to if they knew and understood it.

I don't think either of these are the case here: they are sharing data about what sites people generally visit in an aggregated form doesn't reveal any individual's browsing (what's to object to?) and talk about it in the place people would go to learn about what data they collect.


> It's fine that this is all new to you, but it's not new to you because anyone has kept this secret. At this point, you've chosen to remain ignorant.

Ah yes. Blame the user for not understanding yet another piece in Google's gargantuan data collecting machinery.

Recent court cases revealed that Google's own employees don't know what's tracked and how to turn it off. But I'm sure it's only ignorance that keeps users uninformed.


> only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)"

One big problem there is that we don't know what percentage of users for whom "turned on" is a euphemism for "didn't notice."


"Users who have turned on" implies that they opted in. Is this behavior opt-in or opt-out?


If crux is what Google is willing to make public, it makes one wonder what else is collected and stored for their own use (i.e. their moat).

I’m not using Chrome on all my devices.


Does this setting apply to Android assistant?


1. They're not funneling it through Stanford. They're posting it publicly, but on BigQuery https://developer.chrome.com/docs/crux/

2. Chrome prompts you to opt-out of metrics collection on install.

None of the reasons you've listed for this being ethically dubious are true.


This is ethically dubious: "Chrome prompts you to opt-out of metrics collection on install.".

Because the default should be "opted out by default, let the user opt-in if they so wish"


[flagged]


Please don't cross into personal attack.

https://news.ycombinator.com/newsguidelines.html


Is pointing out a conflict of interest a "personal attack"?


I don't have a good general answer to that.


I very much agree with you. This type of data collection MUST be opt-in to be ethical, and in Chrome it’s enabled by default and buried. The VAST majority of users have no idea this is even happening. It is grossly unethical and it is obvious that it is so, but unsurprisingly folks at Google are happy to do things like this given their salaries.


Edit: I don't want to be a conspiracy theorist

That's not merely a good idea but also

https://news.ycombinator.com/newsguidelines.html

Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.


Maybe your posts would get better votes if you made any effort at all to back up your claim on unethical behavior. You provided nothing.


re: Edit.

I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.

Perhaps a bit of all three?


just don't understand that disagreement is not grounds for down voting.

It is perfectly fine on HN and always has been.


Nah. I see it differently.

The Guidelines are clear about why we're here and expectations. The emphasis is on discussion, learning and objectivity. Yes, disagreement is mentioned (i.e., allowed) but even that needs to be constructive, yes?

A down vote - with no discussion - well, frankly in the context of the Guidelines, is:

1) Not in the spirit of the guidelines; 2) Perhaps redundant to 1, but lazy; 3) At best, small-minded and childish;

If people want to pout about reading something they don't like, this isn't the place for them.

Yeah, I see who you are. And I'm ok w/ pushing back. That's what make HN what it is ;)


You see it differently but it's just not accurate, it's not in the guidelines super-explicitly so easy to miss.

https://news.ycombinator.com/item?id=22910444

https://news.ycombinator.com/item?id=16131314 and there are many many others

Yeah, I see who you are.

I'm literally a random scold on the internet, I just happen to be right about this.


I disagree.

I'm not going to explain why.

How does that feel? What value does it add? (Sweet FA, eh.)

You're right, you might be right. But that does make it right. I get zero satisfaction from context-less down votes. I don't do them. I ignore them when I get them (i.e., they have zero influence on my HN behavior). If I'm changing my mind over some lazy a-holes' click, I'm losing. Big time.

I can't imagine why anyone feels any differently. The reality is, there are pointless noise. There's not enough context to drive anything actionable for anybody.

But while I have your attention: how about a feature request: Karma points that consider the discussion below a top-parent comment.


> ” If Apple or Mozilla did anything remotely like this, Hacker News would riot.”

My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.


Just suggesting that prior browser and OS privacy blowups involving those companies have been over less worrisome things, not that those companies are subject to more or less criticism. Looking back on outraged discussions of Mozilla's telemetry is kinda quaint in comparison.


> My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla.

Not the entirety of HN. As I have more than once delicately pointed out[1], Mozilla is Google's bitch.

[1] https://news.ycombinator.com/item?id=30732539


> HN hates and criticizes Google much more than Apple and Mozilla.

That is mostly because Apple almost never does something like this, and Mozilla literally never does.


> I mean, much more.

Because Google is a web advertisement company that dominates many large spheres: search, browsers (including standards committees), email, mobile (Android is 77% market share) etc. All are things that we've come to view as crucial to modern life.

And time and again they've shown that they only view that dominance as a funnel for ad revenue, data collection, and whatever benefits them at this particular moment.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: