Hacker News new | past | comments | ask | show | jobs | submit login

If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

Description: https://github.com/ClickHouse/ClickHouse/issues/18842

You can easily try it with clickhouse-local without downloading:

  $ curl https://clickhouse.com/ | sh

  $ ./clickhouse local 
    ClickHouse local version 22.13.1.294 (official build).

    milovidov-desktop :) DESCRIBE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    DESCRIBE TABLE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc

    ┌─name────┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
    │ rank    │ UInt32 │              │                    │         │                  │                │
    │ domain  │ String │              │                    │         │                  │                │
    │ log     │ String │              │                    │         │                  │                │
    │ content │ String │              │                    │         │                  │                │
    └─────────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

    4 rows in set. Elapsed: 1.390 sec. 

    milovidov-desktop :) SELECT rank, domain, log, substringUTF8(content, 1, 100) FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1 FORMAT Vertical

    SELECT
        rank,
        domain,
        log,
        substringUTF8(content, 1, 100)
    FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')
    LIMIT 1
    FORMAT Vertical

    Query id: 8dba6976-0bf6-4ce8-a0f1-aa579c828175

    Row 1:
    ──────
    rank:                           1907977
    domain:                         0--0.uk
    log:                            *   Trying 213.32.47.30:80...
    * Connected to 0--0.uk (213.32.47.30) port 80 (#0)
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 302 Moved Temporarily
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:14 GMT
    < Content-Type: text/html
    < Content-Length: 154
    < Connection: keep-alive
    < Location: https://0--0.uk/
    < 
    * Ignoring the response-body
    { [154 bytes data]
    * Connection #0 to host 0--0.uk left intact
    * Issue another request to this URL: 'https://0--0.uk/'
    *   Trying 213.32.47.30:443...
    * Connected to 0--0.uk (213.32.47.30) port 443 (#1)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    *  CAfile: /etc/ssl/certs/ca-certificates.crt
    *  CApath: /etc/ssl/certs
    * TLSv1.0 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    { [108 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Certificate (11):
    { [4150 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    { [333 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server finished (14):
    { [4 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    } [70 bytes data]
    * TLSv1.2 (OUT), TLS header, Finished (20):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
    } [1 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Finished (20):
    } [16 bytes data]
    * TLSv1.2 (IN), TLS header, Finished (20):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Finished (20):
    { [16 bytes data]
    * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
    * ALPN, server accepted to use http/1.1
    * Server certificate:
    *  subject: CN=mail.htservices.co.uk
    *  start date: May 15 18:36:37 2022 GMT
    *  expire date: Aug 13 18:36:36 2022 GMT
    *  subjectAltName: host "0--0.uk" matched cert's "0--0.uk"
    *  issuer: C=US; O=Let's Encrypt; CN=R3
    *  SSL certificate verify ok.
    * TLSv1.2 (OUT), TLS header, Supplemental data (23):
    } [5 bytes data]
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * TLSv1.2 (IN), TLS header, Supplemental data (23):
    { [5 bytes data]
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:15 GMT
    < Content-Type: text/html;charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < X-Frame-Options: SAMEORIGIN
    < Expires: -1
    < Cache-Control: no-store, no-cache, must-revalidate, max-age=0
    < Pragma: no-cache
    < Content-Language: en-US
    < Set-Cookie: ZM_TEST=true;Secure
    < Set-Cookie: ZM_LOGIN_CSRF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly
    < Vary: User-Agent
    < X-UA-Compatible: IE=edge
    < Vary: Accept-Encoding, User-Agent
    < 
    { [13068 bytes data]
    * Connection #1 to host 0--0.uk left intact

    substringUTF8(content, 1, 100): <!DOCTYPE html>
    <!-- set this class so CSS definitions that now use REM size, would work relative to

    1 row in set. Elapsed: 0.539 sec. Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s., 508.28 MB/s.)



How does that work? How can clickehouse-local run queries against a 129 GB file hosted on S3 without downloading the whole thing?

Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html

If so, what's the data.native.zst file format? Is it similar to Parquet?


Yes, the native format is very similar to Parquet.

It works for Parquet as well:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits.parquet') LIMIT 1
And for CSV or TSV:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/tsv/github_events_v3.tsv.xz') LIMIT 1
And for ndJSON:

  SELECT repo_name, created_at, event_type FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/github_events_*.gz', JSONLines, 'repo_name String, actor_login String, created_at String, event_type String') WHERE actor_login = 'simonw' LIMIT 10  
Note: the query above is kind of slow. Here is the query from preloaded data - your activity in GitHub issues:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z...


Another question about that demo.

https://clickhouse.com/docs/en/getting-started/example-datas... says "Dataset contains all events on GitHub from 2011 to Dec 6 2020" - but I'm seeing results in there from a couple of hours ago.

Do you know if that's continually updated and, if so, is that documented anywhere?


Yes, it's continuously updated.

The source code is here: https://github.com/ClickHouse/github-explorer

This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...


Thanks for the info! I wrote this up as a TIL: https://til.simonwillison.net/clickhouse/github-explorer


> How does that work?

Disclaimer: I'm not a Clickhouse user, but I have a bit of experience with Parquet.

It looks like the native format is (very briefly) described here: https://clickhouse.com/docs/en/interfaces/formats/#native

It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.

Looking at:

> Processed 4.60 thousand rows, 273.86 MB

I'd guess it's chunking the rows into groups of ~4,000.

The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)

EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: