Hacker News new | past | comments | ask | show | jobs | submit login
Buckets of Parquet Files Are Awful (scratchdata.com)
20 points by memset 8 months ago | hide | past | favorite | 41 comments



Everything in this article implies a lack of understanding of the more common use cases for Parquet - data sets that exceed typical single node RAM sizes. The article would make more sense titles "Large Buckets of Parquet Files Are Awful on a Laptop".

Don't read Parquet into memory in one pass. Use an engine that can scan it, ex. Spark.

PostgreSQL and Clickhouse have their own storage engines, and bulk importing _large_ volumes of data no matter the format (Parquet, CSV, JSON) will present challenges.

DuckDB is meant to be a single node engine. If your data size exceeds RAM, its going to be a problem.

If your data is not partitioned appropriately for its size, that is a problem with the people who created it.


(OP here) I actually quite agree with you on all of these points. There are tools, and programming techniques, for dealing with everything in my rant. And I also agree that many times people are "holding it wrong" - be it data that has not been partitioned, or people trying to do a "SELECT *" against a terabyte of parquet files.

My central point is that the "bucket of parquet" can be wrangled with by an expert, who knows the tools, their limitations, and programming techniques, in order to get it into a usable state.

But my post is a frustration with the sheer amount of tooling and knowledge required to get started - for example, the case where non-partitioned data is foisted upon non-experts.


> But my post is a frustration with the sheer amount of tooling and knowledge required to get started - for example, the case where non-partitioned data is foisted upon non-experts.

How is that different from throwing a massive relational database at someone who doesn't know how to manage indexes and other optimizations?


Not OP, but I'd guess there's greater industry awareness of relational DBs than there are of parquet files. I've been on the receiving end of a Parquet file that I didn't know how to crack open the ambiguity on how to proceed was frustrating.


This is true. There are two tools you need to know for this: duckdb and visidata. With these tools, Parquet is almost as easy as CSVs (but a few orders of magnitude more powerful and faster)

Parquet is also usable in polars and pandas, and Apache Spark too but that’s getting into complicated territory.

DuckDB it’s literally just

   Select * from ‘s3://bucket/*.parquet’


Relational database technology is more highly proven, stable, documented, and consistent than the constellation of big data solutions. Learning about indexes etc. 20 years ago would still help you today. Learning about this year's big data stack may not even help you next year.


If you’re dealing with non trivial data sizes and if you have an analytics use case and you’re using CSV, you’re already doing it wrong.

You bring up partitioning — that can help performance but with Parquet, you can get performance they is far better than CSV even without partitioning because it only has to read the headers. And no, no one really needs to know about row groups (sure you can eke out more performance but few people do). All this is to say all your points are non starters. There’s no need to know all this and even the least optimized parquet dataset is better than CSV in every way.

The only use case CSV might be good for is ETL into an actual database like Postgres which is what you’re doing in your article but that’s actually only a very small part of the analytics pipeline.

Parquet is actually not that complicated but I always meet people who only know CSV and they feel anything more complicated is beyond them. Don’t be that guy.

Shoehorning everything into CSV creates costs downstream like high storage costs (CSVs are much bigger), and no type safety means you have to validate types with a bunch of non standard rules creating a lot of technical debt.

I’ve always managed to work with parquet with either DuckDB or Visidata (and occasionally Pandas or Polars). Visidata lets you peek into parquet like Vim lets you look into CSVs.

I don’t miss CSVs at all.


The article doesn't claim that CSV is superior, though. It only mentions CSV once, as a workaround for the problem that relational databases mostly don't support importing from Parquet directly.

It does rather feel like at least that part of the rant would be better addressed by adding such support, though. If Parquet is already a de facto standard in the industry, I'm sure there already are Postgres extensions that add such support; how hard would it be to convince the maintainers to take the best-written one and use it to add support directly to core Postgres, just as it already has for CSV and JSON?


buckets of shard files are awesome. many massive pipelines doing important work use that model. it's simple and easy to program around. Inspecting data and sampling it is straightforward. Even record files (with no index) can be easily seeked and sampled. They are extremely efficient by reducing the number of inodes. Tools like LexicographicRangeSharding can be used to make the bucket object sizes more balanced.

Use sampling to make the data size more reasonable for doing local work. But, every tool I've used to deal with large data has handled large data the way I expect: either loading whole files into RAM, mapping whole files into RAM, or using fixed buffers. Typically when my work exceeds RAM, I just move to a machine with bigger RAM but still work with large shard files.


I would say on the contrary, please stop dumping CSV data just because that’s all you know all to deal with.

CSV is full of gotchas which is why CSV readers have so many switches.

They are huge (and hugely inefficient at scale), have no type safety, don’t support predicate push down and are only meant to be read linearly.

Most of the OPs complaints are about loading and memory, not about actually using the data. OP is more concerned about infra and memory usage than actually the end goal: using the the data well.

I’m not saying Parquet is the be all and end all of data formats but it’s a darned sight better than CSVs for almost all analytics use cases.

People who use CSV because it’s readable and understandable either have non analytics use cases or just don’t want to learn anything new.


People who [disagree with my analysis] just don’t want to learn anything new.

CSV is a well-discussed topic on HN. If you'd like to explore other points of view in regard to these issues, you could start with this fine recent thread:

https://news.ycombinator.com/item?id=39814334


I appreciate the caveats.

Notice I mentioned analytics use cases which is a subset of all the software engineering use cases. CSVs have their place as a simple interchange format. They’ve been overextended to use cases they were never mean to support — I’m speaking out against these.

If you’re using CSVs for data work (outside of simple ETL) it a bit like saying let me write a relational database on top of flat text files as a data store.

Meanwhile the world has moved on to engines like Postgres. But no I’m going stick with my flat text format.


Ok, I guess I am repeating myself but so does this post since I assume it spawned from this discussion https://news.ycombinator.com/item?id=40284225

Why does no-one mention Trino or the AWS branded Trino - Athena?

This will solve basically all of the problems listed.

In the case of Athena and provided you already have your data in s3, without having to set up a new service and very little work.

Using Athena is not like setting up a new database. You just point Athena to the files and it will use resources provided by AWS to query. No import required, no service required.

I think Athena is an amazing tool. I can borrow like 120 CPUs (my tasks often parallelize to that) for free during 5 minutes or even an hour to join / group / modify all my data.


Athena is great. It’s essentially querying your data using lambdas on your S3 data.

DuckDB outperforms it slightly on a single node but Athena is a good choice if you need elastic scaling.


I would not compare it to lambda. It is backed by Trino https://trino.io/ and can do map/reduce kind of jobs that lambdas can not.


I just built an NVMe cache when a customer of mine had the problem of reading parquet from S3. Basically first hit goes through the ram, but then it also writes it to NVMe. The next hit just checks etag and if it hasn’t changed, mmaps the file directly from the cached version for faster access. Eviction is typical LRU, but with leases. When disk cache is full (and sufficient storage can’t be released because files are in use) it only uses DRAM. When DRAM is full, which is very unlikely, the cache blocks and waits until stuff gets released.

I’d think twice before abandoning a more or less standard, open format for some brand new shiny thing.


The problem with Parquet on S3 is not parquet but S3.

S3 is great for bulk reads but very high latency for random reads. It’s not really designed for data querying use cases.

There are two ways around this: parquet on S3 Express Zone (a new low latency S3) or EFS.

It’s not Parquet. It’s S3.


The tradition solution to high latency for random reads is to build an index. Sorting data by a good key is also very helpful.

I typically architect systems so taht the index is either a side-file (IE, a .idx file next to to shards), or in a database (postgres, sqlite, whatever). And I will typically have large jobs that go through and update the index periodically.


As I understand Parquet, indexing won't help all that much. The whole point of the format is to have all of column 1 in the file, then column 2, then column 3, and so on for an arbitrary number of columns. So if you want to read a particular row, you're doing random accesses per column in the DB, and if you want to read all the rows out at once to do an import you basically have a "file cursor" per column. (Modulo whether you can bundle cols together or do other fancy things I'm not going to look up because I'm going to guess most people splatting these out because "Parquet is faster and better than X" are not doing those fancy things.)

The point of a Parquet file is that column access. Even on a local disk that access pattern is going to be suboptimal at the disk level, though you may make up for it on improved compression. But over a network that's not great. It's a format designed for being something that can be queried reasonably, or when you have a few hundred columns and frequently just want a few of them, it's not really a great DB dump format under any circumstances. But it can be a "good" one and that's often good enough. It can also go back to great if your dumping into a column-oriented DB that's smart enough to do a column-oriented import even though it may only offer a row-based sort of API to users otherwise, I don't know if anything is doing that yet. (I really would have thought importing a Parquet into Clickhouse would be fast and easy at effectively arbitrary sizes because Clickhouse would be able to do this.)


I've never really worked with parquet or column based systems but just searching for parquet index shows https://github.com/apache/parquet-format/blob/master/PageInd... which describes exactly what I would expect. Maybe it doesn't work well, I don't know- I typically work with structured bundles of data, rather than purely row or column formats (I have a patent on using bundled streams for this purpose).


Yes. Hence the attached NVMe, which is the best thing available for random reads. EFS also sucks for high throughput IO. The aforementioned cache obviously won’t work for scanning petabytes of data, as it relies on having at least some locality. But that wasn’t an issue for this customer


I've recently played with the Overture Maps datasets which are in .parquet and I agree that there is a learning curve.

One thing that's nice about .parquet is typical Macs come with 128/256GB SSDs and you can query 300GB of parquet in S3 without needing all those files on your local hard drive. Some of these queries are also surprisingly fast but I haven't looked into how it works.

Edit: Having used MySQL and Postgres, I actually think DuckDB is better than either.


I don’t think buckets of csv would be better, in fact I know it would be worse because you lose data types and in memory representation.


Can a CSV contain binary? I'm just thinking of the massive file size explosion of stringy'ing integers and floats (like those used for lat/lng coordinates for example)


yes, as long as you can "quote" binary data, you can put it in a CSV. Terrible idea though.


Postgres CAN read from parquet using parquet_fdw, and then you can import directly using "create table x as select" from the parquet foreign tables.

>Postgres can't read from S3, load from Parquet, or query parquet files.


This is true, but how many hosted postgres providers support it?


Surely that problem is best solved by pushing for support in base Postgres?


> Surely

I think it would be great if postgres supported parquet out of the box. However I am not sure that this is the "best" solution - it would take years for the data load to finish if we had to wait for that!


A big part of my job is to ingest data from hundreds of providers to process.

A provider that will push data to my S3 bucket, as parquet files, is the best I can think of.

I don't want no provider managed database to host, stupidly huge zipped CSV dumps, or whatever weird technology the provider can think of.

Seriously reading a file from S3 and stream processing the parquet is not the peak of engineering. What is there to complain about...

Some of the insanity I had to deal with:

- 350TB of zipped CSVs which columns and delimiter vary across time, shipped on HDD by mail

- Self hosting a MSSQL server for the provider to update

- Good old FTP polling at more or less predictable times

- Fancy you-name-it-shiny-new-tech remote database accounts

- The crappy client http website with custom SQL inspired query language to download zipped extracts

- The web 2.0 rest API, billed per request

- Snowflake data warehouse data lake big data

Please, don't be that guy, just push parquet flat files in my bucket.


Unfortunately if "that guy" is your customer then your real job is to make them happy. Which means, by definition, putting up with their crazy-ass formats, services, and data stores.

Or they're some company / agency providing the data for free essentially, but whom you in any case have precisely zero hope of bending them toward your preferences on this matter on any foreseeable timeline. Or perchance they're internal - you're stuck with lobbying and politics (again, to get them to bend your way). Meanwhile you've got stuff to build and such.

Given the associated costs and benefits; and the fact that it's just a day job, after all -- stupidly huge CSV files may not be so bad.


> Unfortunately if "that guy" is your customer then your real job is to make them happy.

Usually the power relationship is not layout like that.

Data providers rarely compete on technology, because their differentiating factor is the uniqueness of the data itself. The clients will put up with whatever insanity the provider will deliver the data with anyway.

The history repeats itself:

The older a provider, the more involved and complex its data offering. Once they have dozens of datasets, packages, clustering, a fleet of sales in the wild negociating with clients and a whole billing system on top of that, they just won't change it. That's how you end up with these insanities.


The arguments presented by the author reminds me of the saying "Incompetent dancer complained that the stage was crooked"


This article would be greatly improved by at least a single sentence saying what your proposed alternative is. Otherwise it sounds like you're saying "having to eat is really annoying. let's stop eating".


It's quite clear that this blog post is in direct response to this comment: https://news.ycombinator.com/item?id=40299364

Nothing wrong with that, but good for folks to know the context.


That is what tipped me over the edge for this little rant. I've talked to a lot of folks who don't have a background in the data ecosystem and want to do fairly mundane things ("copy data out of a parquet file into a regular text file") and just hit this wall where they have to learn to deal with edge case after edge case.

The person in that comment, they didn't choose to get this dump of data in this format, but now they have to contend with it, and it sucks that he - someone who's a very capable dev - had to spend to much time on this problem.


I think there’s a difference between saying “ok I don’t know how to do this, what’s an easier way?”

Rather than going on a rant that “bunch of parquet files sucks!” When it’s clearly untrue and all that was needed was knowledge of the tools.

Reading a ton of data from an unfamiliar format (but one that is specialized toward solving a problem) is not mundane. It’s not meant to be easy. Parquet feels like it should be easy and many tools have been built to make it easy (how far we’ve come since the Java parquet reader!) but it still requires knowing what to do.

Not knowing at first doesn’t mean the right thing to do is to fall back to the wrong but simple solution of CSV. I’ve seen so many people do that in my company’s data pipelines and it causes so many downstream issues. This is the wrong message to send that will incur so much cost for so many companies if believed.

It’s cool that you’re making tools to make things easier. you can do that without ranting about parquet which just seems misdirected.


The article isn't particularly about Parquet or S3. They are both great technologies and I use them all the time.

I mentioned CSV because it is the only format supported by Postgres for ingest (besides their binary protocol) but that is only an instance of the general problem of interoperability.

I'm not sure I agree that "reading a ton of data is not meant to be easy." I'd say it is not easy today, and there is a constellation of tools and programming techniques that can let you perform the task, but it is often a distraction from the end result you want to achieve.

The essence of my rant is that there are many steps distracting steps between "a data dump" and "the thing I want to build" in a way that has been solved in many other parts of the stack.


Things become popular because they are good/decent/better than the alternatives.

Then, popular things become the default choice even when they don't necessarily make sense because they are not solving the problems they originally meant to solve

Then people complain about it


This is the one thing that Azure does better than AWS. With Azure Synapse, you can upload parquet files to a storage account (similar to an AWS bucket), create a view in a serverless SQL instance, and interact with it via any SQL Server client.


They are not awful if you use right tools. Welcome to the world of Spark, Apache Drill, Trino/Presto, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: