Cached Chrome Top Million Websites

mg · on Dec 31, 2022

Top level domains by popularity:

    grep -oP '\.[a-z]+(?=,)' current.csv | sort | uniq -c | sort -n

  ...
  15840 .pl
  17914 .it
  20182 .de
  21690 .in
  27812 .ru
  29194 .jp
  30359 .org
  35741 .br
  36675 .net
 406052 .com

.com domains by popularity:

    grep -oP '[a-z0-9-]+\.com(?=,)' current.csv | sort | uniq -c | sort -n

    ...
    365 tistory.com
    370 fc2.com
    408 skipthegames.com
    489 online.com
    515 wordpress.com
    707 uptodown.com
    880 schoology.com
   2570 fandom.com
   2651 instructure.com
   3244 blogspot.com

azeemba · on Dec 31, 2022

It might be worth updating this comment and explaining your second query.

People seem to think it is somehow measuring visits to those origins. But it's measuring how many unique subdomains are listed for those domains

egman_ekki · on Dec 31, 2022

Rather amazing seeing almost abandoned blogspot.com there at the top.

Also interesting I haven’t heard about half of them. Some are nsfw, apparently.

mometsi · on Dec 31, 2022

These aren't sorted by number of visits, but by the number of rows in the list of most visited sites. Essentially which sites have the greatest number of frequently visited subdomains.

kristianp · on Dec 31, 2022

Loading the data into the duckdb cli [0] and doing the first query:

    create table current as select * from '202211.csv';
    select * from current;
    ┌────────────────────────────────────┬─────────┐
    │               origin               │  rank   │
    │              varchar               │  int32  │
    ├────────────────────────────────────┼─────────┤
    │ https://hochi.news                 │    1000 │
    │ https://www.xnxx.xxx               │    1000 │
    │ https://www.wordreference.com      │    1000 │
    │ https://finance.naver.com          │    1000 │
    │ https://www.macys.com              │    1000 │
    │ https://www.xv-videos1.com         │    1000 │
    │ https://fr.xhamster.com            │    1000 │
    │ https://poki.com                   │    1000 │
    │ https://salonboard.com             │    1000 │
    │ https://clgt.one                   │    1000 │


    select tld, count(*) 
    from (select reverse(substr(reverse(origin),1, position('.' in reverse(origin))-1)) tld 
            from current) 
    group by tld 
    order by count(*) desc;
    ┌───────────┬──────────────┐
    │    tld    │ count_star() │
    │  varchar  │    int64     │
    ├───────────┼──────────────┤
    │ com       │       406052 │
    │ net       │        36675 │
    │ br        │        35741 │
    │ org       │        30359 │
    │ jp        │        29194 │
    │ ru        │        27812 │
    │ in        │        21690 │
    │ de        │        20182 │
    │ it        │        17914 │
    │ pl        │        15840 │
    │ ·         │            · │
    │ ·         │            · │
    │ ·         │            · │
    │ za:5002   │            1 │
    │ lk:8090   │            1 │
    │ org:1445  │            1 │
    │ co:14443  │            1 │
    │ ar:3016   │            1 │
    │ net:8001  │            1 │
    │ care:9624 │            1 │
    │ au:8443   │            1 │
    │ com:333   │            1 │
    │ edu:9016  │            1 │
    ├───────────┴──────────────┤
    │   2076 rows (20 shown)   │
    └──────────────────────────┘

[0] https://duckdb.org/docs/installation/

Llamamoe · on Jan 1, 2023

Are the bottom entries domain:port? I feel like the table would be more interesting excluding those.

voytec · on Dec 31, 2022

sort -r will reverse the order from most to less popular.

layer8 · on Dec 31, 2022

codetrotter · on Dec 31, 2022

uniq -c is in the pipeline because it counts the number of uniques

layer8 · on Dec 31, 2022

Ah, right, missed that.

slim · on Dec 31, 2022

it's amazing that fandom is number 3 and wikipedia is not even there

csande17 · on Dec 31, 2022

Wikipedia uses a .org domain, so it won't show up on "most popular .com domains" lists. (And I think the parent comment is searching for domains with lots of subdomains, which is why providers like Blogspot and Fandom show up.)

wirthjason · on Dec 31, 2022

Curious where PornHub and other sites rank. I always hear how that porn sites are in the top X of all traffics but people don’t talk about due to its nature.

I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.

https://www.pornhub.com/insights/2022-year-in-review

mtmail · on Dec 31, 2022

"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."

layer8 · on Dec 31, 2022

A quick grep shows that there are almost 2.5K domains at 1M rank with “porn” in their name.

The data science teams likely provide a considerable ROI in that industry.

system2 · on Dec 31, 2022

Majority of porn sites do not have porn word in them though. I wish there was a categorization of these.

Giorgi · on Dec 31, 2022

does it though? Pretty sure adult results are being filtered off by Google tool named "SafeSearch". It removes anything adult from SERP and it is on by default.

yorwba · on Dec 31, 2022

This appears to have some unintuitive consequences. When I searched for "porn" in a cookie-less session just now, there were still porn results, but no well-known sites (at least I didn't recognize the names). Searching for literally "pornhub", the first result is "porhub.com" without the "n".

Seems like the "SafeSearch" filter is based on a list of "adult domains" instead of the indexed content at the URL.

E39M5S62 · on Dec 31, 2022

If you ignore the content, large-scale adult sites are just like any other high traffic (bandwidth, RPS) site out there. A lot of planning goes into where their content delivery PoPs should be placed.

mtmail · on Dec 31, 2022

"New Year’s Eve kicked holiday ass with a massive –40% drop in worldwide traffic from 6pm to Midnight on December 31st." It's Dec/31, 1pm in New York right now.

xwowsersx · on Jan 1, 2023

I remember reading about their experience with Redis https://groups.google.com/g/redis-db/c/d4QcWV0p-YM there is something funny about reading engineering insights from a porn co, but they do deal with scale that not many others do!

oars · on Dec 31, 2022

"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."

est · on Dec 31, 2022

Looks like not a single Chinese site made to top 1k. I guess it's reasonable because all Google services were blocked so CrUX can't gather any data.

themoonisachees · on Dec 31, 2022

Do Chinese people use chrome? One would think the download page is blocked as well, so the demographic for chrome users should be way smaller.

Also to consider: China uses in-app browsing a lot, with interactive experiences very similar to websites built right in the bilibili/ali/wechat apps.

est · on Jan 1, 2023

> One would think the download page is blocked as well

Contrary to popular belief, Google only pulled Search business out of China. The rest of services is still hosted on Google.cn inside China. To download Chrome:

$ curl -svk 'https://www.google.cn/intl/zh-CN/chrome/'

* Trying 180.163.150.34...

* TCP_NODELAY set

* Connected to www.google.cn (180.163.150.34) port 443 (#0)

However the "Make searches and browsing better (Sends URLs of pages you visit to Google)" data won't be collected, because the connection would be blocked.

moffkalast · on Dec 31, 2022

> in-app browsing

But that's also just chromium isn't it, much like a PWA? Unless they made something of their own.

jfoster · on Jan 1, 2023

But does chromium phone home with this data? Surely not, I would've thought?

ksec · on Jan 1, 2023

I assume the data is aggregate across all devices. Chrome has 60% of desktop usage in China. But less than 10% on Mobile.

But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.

kristianp · on Dec 31, 2022

> The CrUX dataset is based on data collected from Google Chrome and is thus biased away from countries with limited Chrome usage (e.g., China).

est · on Jan 1, 2023

Chrome is the dominate browser in China. Even if not Chrome, then it's Chromium-based alternative UI shells.

protonscientist · on Jan 1, 2023

It's fascinating (and says a lot) that Google's internet monopoly persists even in places where it's outright banned.

modeless · on Dec 31, 2022

Wow, I'm kinda surprised to find my site in the top million worldwide. I have about 100k monthly visits as measured by Cloudflare web analytics, I guess that's all it takes.

zX41ZdbW · on Dec 31, 2022

If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

Description: https://github.com/ClickHouse/ClickHouse/issues/18842

You can easily try it with clickhouse-local without downloading:

  $ curl https://clickhouse.com/ | sh

  $ ./clickhouse local 
    ClickHouse local version 22.13.1.294 (official build).

    milovidov-desktop :) DESCRIBE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    DESCRIBE TABLE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc

    ┌─name────┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
    │ rank    │ UInt32 │              │                    │         │                  │                │
    │ domain  │ String │              │                    │         │                  │                │
    │ log     │ String │              │                    │         │                  │                │
    │ content │ String │              │                    │         │                  │                │
    └─────────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

    4 rows in set. Elapsed: 1.390 sec. 

    milovidov-desktop :) SELECT rank, domain, log, substringUTF8(content, 1, 100) FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1 FORMAT Vertical

    SELECT
        rank,
        domain,
        log,
        substringUTF8(content, 1, 100)
    FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')
    LIMIT 1
    FORMAT Vertical

    Query id: 8dba6976-0bf6-4ce8-a0f1-aa579c828175

    Row 1:
    ──────
    rank:                           1907977
    domain:                         0--0.uk
    log:                            *   Trying 213.32.47.30:80...
    * Connected to 0--0.uk (213.32.47.30) port 80 (#0)
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 302 Moved Temporarily
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:14 GMT
    < Content-Type: text/html
    < Content-Length: 154
    < Connection: keep-alive
    < Location: https://0--0.uk/
    < 
    * Ignoring the response-body
    { [154 bytes data]
    * Connection #0 to host 0--0.uk left intact
    * Issue another request to this URL: 'https://0--0.uk/'
    *   Trying 213.32.47.30:443...
    * Connected to 0--0.uk (213.32.47.30) port 443 (#1)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    *  CAfile: /etc/ssl/certs/ca-certificates.crt
    *  CApath: /etc/ssl/certs
    * TLSv1.0 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    { [108 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Certificate (11):
    { [4150 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    { [333 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server finished (14):
    { [4 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    } [70 bytes data]
    * TLSv1.2 (OUT), TLS header, Finished (20):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
    } [1 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Finished (20):
    } [16 bytes data]
    * TLSv1.2 (IN), TLS header, Finished (20):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Finished (20):
    { [16 bytes data]
    * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
    * ALPN, server accepted to use http/1.1
    * Server certificate:
    *  subject: CN=mail.htservices.co.uk
    *  start date: May 15 18:36:37 2022 GMT
    *  expire date: Aug 13 18:36:36 2022 GMT
    *  subjectAltName: host "0--0.uk" matched cert's "0--0.uk"
    *  issuer: C=US; O=Let's Encrypt; CN=R3
    *  SSL certificate verify ok.
    * TLSv1.2 (OUT), TLS header, Supplemental data (23):
    } [5 bytes data]
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * TLSv1.2 (IN), TLS header, Supplemental data (23):
    { [5 bytes data]
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:15 GMT
    < Content-Type: text/html;charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < X-Frame-Options: SAMEORIGIN
    < Expires: -1
    < Cache-Control: no-store, no-cache, must-revalidate, max-age=0
    < Pragma: no-cache
    < Content-Language: en-US
    < Set-Cookie: ZM_TEST=true;Secure
    < Set-Cookie: ZM_LOGIN_CSRF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly
    < Vary: User-Agent
    < X-UA-Compatible: IE=edge
    < Vary: Accept-Encoding, User-Agent
    < 
    { [13068 bytes data]
    * Connection #1 to host 0--0.uk left intact

    substringUTF8(content, 1, 100): <!DOCTYPE html>
    <!-- set this class so CSS definitions that now use REM size, would work relative to

    1 row in set. Elapsed: 0.539 sec. Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s., 508.28 MB/s.)

simonw · on Dec 31, 2022

How does that work? How can clickehouse-local run queries against a 129 GB file hosted on S3 without downloading the whole thing?

Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html

If so, what's the data.native.zst file format? Is it similar to Parquet?

zX41ZdbW · on Dec 31, 2022

Yes, the native format is very similar to Parquet.

It works for Parquet as well:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits.parquet') LIMIT 1

And for CSV or TSV:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/tsv/github_events_v3.tsv.xz') LIMIT 1

And for ndJSON:

  SELECT repo_name, created_at, event_type FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/github_events_*.gz', JSONLines, 'repo_name String, actor_login String, created_at String, event_type String') WHERE actor_login = 'simonw' LIMIT 10

Note: the query above is kind of slow. Here is the query from preloaded data - your activity in GitHub issues:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z...

simonw · on Dec 31, 2022

Another question about that demo.

https://clickhouse.com/docs/en/getting-started/example-datas... says "Dataset contains all events on GitHub from 2011 to Dec 6 2020" - but I'm seeing results in there from a couple of hours ago.

Do you know if that's continually updated and, if so, is that documented anywhere?

zX41ZdbW · on Jan 1, 2023

Yes, it's continuously updated.

The source code is here: https://github.com/ClickHouse/github-explorer

This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...

simonw · on Jan 1, 2023

Thanks for the info! I wrote this up as a TIL: https://til.simonwillison.net/clickhouse/github-explorer

cldellow · on Dec 31, 2022

> How does that work?

Disclaimer: I'm not a Clickhouse user, but I have a bit of experience with Parquet.

It looks like the native format is (very briefly) described here: https://clickhouse.com/docs/en/interfaces/formats/#native

It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.

Looking at:

> Processed 4.60 thousand rows, 273.86 MB

I'd guess it's chunking the rows into groups of ~4,000.

The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)

EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)

anonu · on Dec 31, 2022

> grep http: current.csv | wc -l

54679

So over 5% of the top 1m sites still don't use HTTPS.

zX41ZdbW · on Dec 31, 2022

I have prepared a nice report: the rank of the websites in groups 1..10, 11..100, ... the percentage of TLS and an example of non-TLS website:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK...

    SELECT
        floor(log10(rank)) AS r,
        count() AS total,
        sum(log LIKE '%TLS%') AS tls,
        round(tls / total, 2) AS ratio,
        anyIf(domain, log NOT LIKE '%TLS%')
    FROM minicrawl
    WHERE log LIKE '%Content-Length:%'
    GROUP BY r
    ORDER BY r

    ┌─r─┬───total─┬─────tls─┬─ratio─┬─anyIf(domain, notLike(log, '%TLS%'))─┐
    │ 0 │       6 │       6 │     1 │                                      │
    │ 1 │      61 │      58 │  0.95 │ baidu.com                            │
    │ 2 │     599 │     562 │  0.94 │ google.cn                            │
    │ 3 │    5591 │    5057 │   0.9 │ volganet.ru                          │
    │ 4 │   51279 │   44291 │  0.86 │ furbo.co                             │
    │ 5 │  476181 │  361910 │  0.76 │ funygold.com                         │
    │ 6 │ 3797023 │ 2927052 │  0.77 │ funyo.vip                            │
    └───┴─────────┴─────────┴───────┴──────────────────────────────────────┘

    7 rows in set. Elapsed: 0.844 sec. Processed 7.59 million rows, 43.74 GB (8.99 million rows/s., 51.83 GB/s.)

kedmi · on Dec 31, 2022

Excuse my ignorance, what CLI tool did you use to execute this query? Thanks!

zX41ZdbW · on Dec 31, 2022

clickhouse-client

Download it as:

  curl https://clickhouse.com/ | sh

Connect to the demo service:

  clickhouse-client --host play.clickhouse.com --user play --secure

alfu · on Dec 31, 2022

If I am not mistaken, 8310 sites offer http and https:

    grep -o -E "://.*?," current.csv | sort | uniq -c | grep -v "1 ://" | wc -l
    8310

philipphutterer · on Dec 31, 2022

How about websites that are browsed http first and then redirected? People might browse for a domain without the https prefix for convenience (or old links) and the browser defaults to http.

_nhynes · on Dec 31, 2022

“The 5% rule”

forgotmypw17 · on Dec 31, 2022

[flagged]

bedatadriven · on Dec 31, 2022

What accessibility challenges does https pose?

forgotmypw17 · on Dec 31, 2022

The accessibility challenges are all the extra different failure modes HTTPS presents, such as client date offset, older devices, expired certificates, hostname mismatches, and many others.

Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.

judge2020 · on Dec 31, 2022

> Security is not the only priority in existence. Sometimes people just want to access the information. And when that is the case, HTTPS can be a huge impediment.

I suppose you'd be fine if your government started replacing the content of Wikipedia with their own propaganda/removing critical information about themselves from traffic?

forgotmypw17 · on Dec 31, 2022

So far they haven't. Meanwhile, millions of people with older devices cannot use them to access wikipedia.

slim · on Dec 31, 2022

you're aware that your government is already doing this, right ? (your argument is invalid)

judge2020 · on Dec 31, 2022

how? Edits on Wikipedia are public, including historical monthly backups available over bt all the way back to 2006, and I can ensure Wikipedia servers are serving it correctly by cross-referencing that and the edits. With http, any ISP (whose operators all tend to favor government cooperation) or switch in the middle could sed content to remove or slightly alter known-critical content.

2Gkashmiri · on Dec 31, 2022

yeah... i havent gotten a good response why localhost should scream "insecure" or why i should wikipedia fail if my rtc clock is wonky.

i am not denying "security from snoops while paying with credit cards" and all that banking shit or messaging. heck, email is sent over the clear but we are told to use https to connect to the website (for webmails) using https for "security"...

sure sure security is all good and snazzy but i regularly come across websites who have had certs expired and the website makes it appear as if the sky will fall if i click on continue.

then we have ISPs who use DPI (my current ISP, reliance jio is doing it from day 1) so whats the point of pretending anyway?

jefftk · on Dec 31, 2022

> why localhost should scream "insecure"

Localhost, even with HTTP, is a secure context: https://developer.mozilla.org/en-US/docs/Web/Security/Secure...

What tool is screaming at you that localhost is insecure?

scrose · on Dec 31, 2022

They may be using a self-signed cert so it’s https://localhost and the browser is flagging the cert rather than localhost itself.

2Gkashmiri · on Dec 31, 2022

browsers. padlock icon is crossed out

jefftk · on Dec 31, 2022

I just tested this and don't see that. I compared http://neverssl.com to running "python3 -m http.server" and visiting http://localhost:8000

* Chrome: "Not Secure" on neverssl, "i" in a circle on localhost

* Firefox: Padlock with a red line through it on neverssl, page icon on localhost

* Safari: "Not Secure" on neverssl, no message on localhost

rileymat2 · on Dec 31, 2022

> email is sent over the clear but we are told to use https to connect to the website (for webmails) using https for "security"

This is not true, most is encrypted in transit. It is not end to end, because your email service stores them (perhaps encrypted perhaps not).

Edit: https://transparencyreport.google.com/safer-email/overview?h...

You can see 84% of outbound is encrypted. This probably is generally a good proxy for the state of email tls transport.

kristianp · on Dec 31, 2022

This raises the question: how much in the way of user telemetry does Chrome send back to google?

tgsovlerkhgsel · on Dec 31, 2022

By default, a lot. However, they also are (or at least used to be, it seems to be quite outdated now) really good at documenting their telemetry publicly: https://www.google.com/chrome/privacy/whitepaper.html

(I haven't checked whether the documentation is complete/accurate, of course.)

Mortiffer · on Dec 31, 2022

Thanks for pointing this out. Can definitely put this dataset to use

cronaday · on Dec 31, 2022

This is very ethically dubious. Google is collecting raw URLs from Chrome users who turned on history syncing across their own devices, then reusing the data and funneling it through Stanford. No way Chrome users understand or approve of this.

The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.

Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.

dang · on Dec 31, 2022

Can you please make your substantive points without breaking the site guidelines? You did that here with your last paragraph, and worse at https://news.ycombinator.com/item?id=34197958.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

cronaday · on Dec 31, 2022

Hi dang, I'm new here so I'd appreciate clarification.

Someone defending this privacy debacle on Hacker News is a Google employee on the Chrome team and was a business cofounder with the Stanford collaborator. That person not only failed to identify how very close they are to the topic, but also phrased their comment in a way that falsely represented distance from the topic. It seems to me essential for understanding their misleading comment to be aware of the factual context.

I thought I had phrased this factual correction in a way that was neutral and not a personal attack. My assumption was that the commenter may have violated Hacker News guidelines by being so misleading. What did I do wrong?

As for the downvotes, I see that I should have emailed you rather than adding a note in the comment. Nonetheless, could you see what's going on?

pvg · on Dec 31, 2022

by being so misleading.

The commenter publicly identifies themselves in their HN profile and you're using that to attack them. It's completely backwards to say they've misrepresented anything. The essential thing is to assume good faith and not go on weird innuendo-laden witch-hunts.

dang · on Dec 31, 2022

It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but the default impact of doing this is to poison discussion so badly that the default bias has to be "don't do it". Certainly you should not be doing it as part of a flamewar post, which your comments in this thread have been. We want curious conversation here, not people cross examining each other.

I'm not disagreeing with you about the underlying issue—there's an argument to be made that the kind of "publishing" that Google/Chrome does here is is really a way of obscuring it from the majority of users, and so on. HN commenters are certainly welcome to make that kind of argument. But we need you to err on the side of not posting in the flamewar style. If I see a commenter posting in the flamewar style and then also bringing in someone's personal details as ammunition, it's no longer a tough-thing-to-balance, it's just out of line.

"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."

https://news.ycombinator.com/newsguidelines.html

jefftk · on Dec 31, 2022

Google has written publicly about how this system works: https://developer.chrome.com/docs/crux/methodology/ https://www.google.com/chrome/privacy/whitepaper.html#usages...

This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.

cronaday · on Dec 31, 2022

> Google has written publicly about how this system works

If this is news to Hackers News, there is no way that regular Chrome users are aware of it. Saying something in a privacy policy or on a developer website just can't be enough for analyzing a person's URL data.

> This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.

Since when does aggregating this type of data make it fair game? This is analyzing a person's URL data from their own devices. There has always been a big bright red line for browsers touching a user's browsing history. Google crossed that line.

Also, I just checked on a fresh Chrome install. The "Make searches and browsing better" option is enabled by default and buried in Chrome settings. How is that acceptable consent for analyzing a person's URL data?

jasonlotito · on Dec 31, 2022

[flagged]

dang · on Dec 31, 2022

Please don't cross into personal attack.

https://news.ycombinator.com/newsguidelines.html

Edit: you've unfortunately been breaking the site guidelines a ton lately. Seriously not cool, and well past the line at which we start banning the account.

I don't want to ban you, but if you keep this up, we'll have to. If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules, we'd appreciate it.

lucb1e · on Dec 31, 2022

> This might be new to you, but that does not mean it's some new information that's been hidden.

I downloaded Chrome on a new laptop an hour ago (at my employer's request, I'd use Firefox myself) and was certainly not aware of this.

This information was not on any screen at any point. There was a default-checked checkmark for some general statistics sharing which I only noticed after clicking download (because it was small and below the download button), but didn't click through to the privacy policy to learn more.

Guess I should have read the privacy policy. I'm trying to find what it said now, but I can't see it anymore because different terms apply to Linux downloads and there's no button to download the Windows version. Basically, visiting the same page in Firefox on Linux (instead of Edge on Windows, which I don't have access to atm) gives me different contents and no checkmark.

marcosdumay · on Dec 31, 2022

Is it opt-in or opt-out? And if it's opt-in, does it come with infinite nagging until you opt-in?

I know login-in and syncing your data are "opt-in" options that come with infinite nagging (so, actually required options). The information that there are different levels of syncing is news to me.

cronaday · on Dec 31, 2022

If a user can somehow someway somewhere learn about what a company is doing, then it’s OK? Really?

jstummbillig · on Dec 31, 2022

What is it you are proposing? If it were every institutions obligation to make sure, that all its instrumental functions were obvious to every potential user and keep any user from engaging with the institution under any false assumptions, nothing in our society would work.

That it not to say that scrutiny is not important. You should certainly be allowed to point at any individual function and demand more upfront transparency, over what is currently being offered. But be aware of the massive additional cognitive load you create, for everyone, when you are not just demanding information availability, but that this information is being delivered to anyone it might concern. Any individuals preference to not care about a function would have to take the backseat to the opinion, that they have to at least somewhat consider the function before engaging.

Considering how expensive this process is, "Google Chrome CrUX" would probably be pretty far down on the list for me personally, as "crucial things everyone should definitely know about before possibly engaging" goes, but to each their own.

jefftk · on Dec 31, 2022

I could see two main arguments for this not being okay:

* Chrome is secretly collecting data.

* Chrome is doing something users would object to if they knew and understood it.

I don't think either of these are the case here: they are sharing data about what sites people generally visit in an aggregated form doesn't reveal any individual's browsing (what's to object to?) and talk about it in the place people would go to learn about what data they collect.

dmitriid · on Dec 31, 2022

> It's fine that this is all new to you, but it's not new to you because anyone has kept this secret. At this point, you've chosen to remain ignorant.

Ah yes. Blame the user for not understanding yet another piece in Google's gargantuan data collecting machinery.

Recent court cases revealed that Google's own employees don't know what's tracked and how to turn it off. But I'm sure it's only ignorance that keeps users uninformed.

addingnumbers · on Dec 31, 2022

> only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)"

One big problem there is that we don't know what percentage of users for whom "turned on" is a euphemism for "didn't notice."

blast · on Dec 31, 2022

"Users who have turned on" implies that they opted in. Is this behavior opt-in or opt-out?

vachina · on Dec 31, 2022

If crux is what Google is willing to make public, it makes one wonder what else is collected and stored for their own use (i.e. their moat).

I’m not using Chrome on all my devices.

kevin_thibedeau · on Dec 31, 2022

Does this setting apply to Android assistant?

dadrian · on Dec 31, 2022

1. They're not funneling it through Stanford. They're posting it publicly, but on BigQuery https://developer.chrome.com/docs/crux/

2. Chrome prompts you to opt-out of metrics collection on install.

None of the reasons you've listed for this being ethically dubious are true.

dmitriid · on Dec 31, 2022

This is ethically dubious: "Chrome prompts you to opt-out of metrics collection on install.".

Because the default should be "opted out by default, let the user opt-in if they so wish"

cronaday · on Dec 31, 2022

[flagged]

dang · on Dec 31, 2022

Please don't cross into personal attack.

https://news.ycombinator.com/newsguidelines.html

userbinator · on Dec 31, 2022

Is pointing out a conflict of interest a "personal attack"?

dang · on Dec 31, 2022

I don't have a good general answer to that.

tristor · on Dec 31, 2022

I very much agree with you. This type of data collection MUST be opt-in to be ethical, and in Chrome it’s enabled by default and buried. The VAST majority of users have no idea this is even happening. It is grossly unethical and it is obvious that it is so, but unsurprisingly folks at Google are happy to do things like this given their salaries.

pvg · on Dec 31, 2022

Edit: I don't want to be a conspiracy theorist

That's not merely a good idea but also

https://news.ycombinator.com/newsguidelines.html

Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.

jeffbee · on Dec 31, 2022

Maybe your posts would get better votes if you made any effort at all to back up your claim on unethical behavior. You provided nothing.

chiefalchemist · on Dec 31, 2022

re: Edit.

I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.

Perhaps a bit of all three?

pvg · on Dec 31, 2022

just don't understand that disagreement is not grounds for down voting.

It is perfectly fine on HN and always has been.

chiefalchemist · on Dec 31, 2022

Nah. I see it differently.

The Guidelines are clear about why we're here and expectations. The emphasis is on discussion, learning and objectivity. Yes, disagreement is mentioned (i.e., allowed) but even that needs to be constructive, yes?

A down vote - with no discussion - well, frankly in the context of the Guidelines, is:

1) Not in the spirit of the guidelines; 2) Perhaps redundant to 1, but lazy; 3) At best, small-minded and childish;

If people want to pout about reading something they don't like, this isn't the place for them.

Yeah, I see who you are. And I'm ok w/ pushing back. That's what make HN what it is ;)

pvg · on Dec 31, 2022

You see it differently but it's just not accurate, it's not in the guidelines super-explicitly so easy to miss.

https://news.ycombinator.com/item?id=22910444

https://news.ycombinator.com/item?id=16131314 and there are many many others

Yeah, I see who you are.

I'm literally a random scold on the internet, I just happen to be right about this.

chiefalchemist · on Dec 31, 2022

I disagree.

I'm not going to explain why.

How does that feel? What value does it add? (Sweet FA, eh.)

You're right, you might be right. But that does make it right. I get zero satisfaction from context-less down votes. I don't do them. I ignore them when I get them (i.e., they have zero influence on my HN behavior). If I'm changing my mind over some lazy a-holes' click, I'm losing. Big time.

I can't imagine why anyone feels any differently. The reality is, there are pointless noise. There's not enough context to drive anything actionable for anybody.

But while I have your attention: how about a feature request: Karma points that consider the discussion below a top-parent comment.

soneca · on Dec 31, 2022

> ” If Apple or Mozilla did anything remotely like this, Hacker News would riot.”

My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.

cronaday · on Dec 31, 2022

Just suggesting that prior browser and OS privacy blowups involving those companies have been over less worrisome things, not that those companies are subject to more or less criticism. Looking back on outraged discussions of Mozilla's telemetry is kinda quaint in comparison.

hericium · on Dec 31, 2022

> My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla.

Not the entirety of HN. As I have more than once delicately pointed out[1], Mozilla is Google's bitch.

[1] https://news.ycombinator.com/item?id=30732539

marcosdumay · on Dec 31, 2022

> HN hates and criticizes Google much more than Apple and Mozilla.

That is mostly because Apple almost never does something like this, and Mozilla literally never does.

dmitriid · on Dec 31, 2022

> I mean, much more.

Because Google is a web advertisement company that dominates many large spheres: search, browsers (including standards committees), email, mobile (Android is 77% market share) etc. All are things that we've come to view as crucial to modern life.

And time and again they've shown that they only view that dominance as a funnel for ad revenue, data collection, and whatever benefits them at this particular moment.