Hacker News new | past | comments | ask | show | jobs | submit login

I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.

DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.




I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.

There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.

Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.


Spark docs are way too minimal for my taste, at least the API docs.


yeah i couldnt get it done in spark api had to combine spark and spark sql bc the window function i needed was (probably) not available in spark. it was inelegant i thought.


I have not worked with Spark, but I have used Athena/Trino and BigQuery extensively.

For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.

With a single node engine you have a ceiling how good it can get.

With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.


I'm using both Spark and polars, to me the appeal of polars is additionally it is also much faster and easier to set up.

Spark is great if you have large datasets since you can easily scale as you said. But if the dataset is small-ish (<50 million rows) you hit a lower bound in Spark in terms of how fast the job can run. Even if the job is super simple it take 1-2 minutes. Polars on the other hand is almost instantaneous (< 1 second). Doesn't sound like much but to me makes a huge difference when iterating on solutions.


>With a single node engine you have a ceiling how good it can get.

Well, you are a lot closer to that with Polars than with Pandas at least.


I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.


I thought the same thing about Spark, coming from R and later Pandas.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: