The problem with Parquet on S3 is not parquet but S3. S3 is great for bulk reads...

dekhn · 2024-05-09T18:38:12 1715279892

The tradition solution to high latency for random reads is to build an index. Sorting data by a good key is also very helpful.

I typically architect systems so taht the index is either a side-file (IE, a .idx file next to to shards), or in a database (postgres, sqlite, whatever). And I will typically have large jobs that go through and update the index periodically.

jerf · 2024-05-10T13:23:14 1715347394

As I understand Parquet, indexing won't help all that much. The whole point of the format is to have all of column 1 in the file, then column 2, then column 3, and so on for an arbitrary number of columns. So if you want to read a particular row, you're doing random accesses per column in the DB, and if you want to read all the rows out at once to do an import you basically have a "file cursor" per column. (Modulo whether you can bundle cols together or do other fancy things I'm not going to look up because I'm going to guess most people splatting these out because "Parquet is faster and better than X" are not doing those fancy things.)

The point of a Parquet file is that column access. Even on a local disk that access pattern is going to be suboptimal at the disk level, though you may make up for it on improved compression. But over a network that's not great. It's a format designed for being something that can be queried reasonably, or when you have a few hundred columns and frequently just want a few of them, it's not really a great DB dump format under any circumstances. But it can be a "good" one and that's often good enough. It can also go back to great if your dumping into a column-oriented DB that's smart enough to do a column-oriented import even though it may only offer a row-based sort of API to users otherwise, I don't know if anything is doing that yet. (I really would have thought importing a Parquet into Clickhouse would be fast and easy at effectively arbitrary sizes because Clickhouse would be able to do this.)

dekhn · 2024-05-10T14:51:57 1715352717

I've never really worked with parquet or column based systems but just searching for parquet index shows https://github.com/apache/parquet-format/blob/master/PageInd... which describes exactly what I would expect. Maybe it doesn't work well, I don't know- I typically work with structured bundles of data, rather than purely row or column formats (I have a patent on using bundled streams for this purpose).

ein0p · 2024-05-09T18:32:56 1715279576

Yes. Hence the attached NVMe, which is the best thing available for random reads. EFS also sucks for high throughput IO. The aforementioned cache obviously won’t work for scanning petabytes of data, as it relies on having at least some locality. But that wasn’t an issue for this customer