If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
It looks similar at a high level to Parquet: binary, columnar and has metadata that permits requesting a subset of the data.
Looking at:
> Processed 4.60 thousand rows, 273.86 MB
I'd guess it's chunking the rows into groups of ~4,000.
The OP must have a nice connection if that completed in 0.5 seconds! (Or perhaps the 273.86MB is the uncompressed size after zstd compression, or perhaps there were other parts of the session that caused that chunk to get cached, and it was elided from what was pasted in to HN.)
EDIT: I was curious, so I ran the tool and watched bandwidth on iftop. It uses about ~50MB each time I run the query. From this, I conclude: it does not cache things, the 273.86MB is the uncompressed size, and OP has a much better internet connection than me. :)
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).
Description: https://github.com/ClickHouse/ClickHouse/issues/18842
You can easily try it with clickhouse-local without downloading: