Hacker News new | past | comments | ask | show | jobs | submit login

I have not worked with Spark, but I have used Athena/Trino and BigQuery extensively.

For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.

With a single node engine you have a ceiling how good it can get.

With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.




I'm using both Spark and polars, to me the appeal of polars is additionally it is also much faster and easier to set up.

Spark is great if you have large datasets since you can easily scale as you said. But if the dataset is small-ish (<50 million rows) you hit a lower bound in Spark in terms of how fast the job can run. Even if the job is super simple it take 1-2 minutes. Polars on the other hand is almost instantaneous (< 1 second). Doesn't sound like much but to me makes a huge difference when iterating on solutions.


>With a single node engine you have a ceiling how good it can get.

Well, you are a lot closer to that with Polars than with Pandas at least.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: