If you are interested in the research on technologies used on the Internet, I re...

simonw · on Dec 31, 2022

How does that work? How can clickehouse-local run queries against a 129 GB file hosted on S3 without downloading the whole thing?

Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html

If so, what's the data.native.zst file format? Is it similar to Parquet?

zX41ZdbW · on Dec 31, 2022

Yes, the native format is very similar to Parquet.

It works for Parquet as well:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits.parquet') LIMIT 1

And for CSV or TSV:

  SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/tsv/github_events_v3.tsv.xz') LIMIT 1

And for ndJSON:

  SELECT repo_name, created_at, event_type FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/github_events_*.gz', JSONLines, 'repo_name String, actor_login String, created_at String, event_type String') WHERE actor_login = 'simonw' LIMIT 10

Note: the query above is kind of slow. Here is the query from preloaded data - your activity in GitHub issues:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z...

simonw · on Dec 31, 2022

Another question about that demo.

https://clickhouse.com/docs/en/getting-started/example-datas... says "Dataset contains all events on GitHub from 2011 to Dec 6 2020" - but I'm seeing results in there from a couple of hours ago.

Do you know if that's continually updated and, if so, is that documented anywhere?

zX41ZdbW · on Jan 1, 2023

Yes, it's continuously updated.

The source code is here: https://github.com/ClickHouse/github-explorer

This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...

simonw · on Jan 1, 2023

Thanks for the info! I wrote this up as a TIL: https://til.simonwillison.net/clickhouse/github-explorer

cldellow · on Dec 31, 2022

> How does that work?

Disclaimer: I'm not a Clickhouse user, but I have a bit of experience with Parquet.

It looks like the native format is (very briefly) described here: https://clickhouse.com/docs/en/interfaces/formats/#native

It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.

Looking at:

> Processed 4.60 thousand rows, 273.86 MB

I'd guess it's chunking the rows into groups of ~4,000.

The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)

EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)