Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is DuckDB a better tool for your purposes?

Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.



I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).


Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.


Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: