I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.
There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
yeah i couldnt get it done in spark api had to combine spark and spark sql bc the window function i needed was (probably) not available in spark. it was inelegant i thought.
I have not worked with Spark, but I have used Athena/Trino and BigQuery extensively.
For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.
With a single node engine you have a ceiling how good it can get.
With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.
I'm using both Spark and polars, to me the appeal of polars is additionally it is also much faster and easier to set up.
Spark is great if you have large datasets since you can easily scale as you said. But if the dataset is small-ish (<50 million rows) you hit a lower bound in Spark in terms of how fast the job can run. Even if the job is super simple it take 1-2 minutes. Polars on the other hand is almost instantaneous (< 1 second). Doesn't sound like much but to me makes a huge difference when iterating on solutions.
I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.
I had just been playing the splendid voxel game teardown recently, and noticed on reading this headline that characters in the game are named after Amanatides and Woo! What a fun Easter egg.
This looks pretty good! Results of roughly this caliber are already really common with local, and freely usable tools and models though. Picking one randomly: https://github.com/jtscmw01/ComfyUI-DiffBIR
The Reddit StableDiffusion and related groups have a ton of upscaling workflows that use diffusion models, GANS and the like to dream up the additional pixels for extreme zoom-and-enhance use cases.
I'm reminded fondly of the opening chapter to QNTM's Fine Structure, which sets up the story with a similarly baffling (yet self consistent, in my opinion) but wildly evocative narration of a blazing dogfight across a multiverse of hyperspaces.
One of the ways I failed to enjoy Max Gladstone's Empress of Forever was that it seemed hellbent on coining a significant new glossary worth of vocabulary along the way. I had to be well rested and attentive to muddle through with some marginal comprehension of any given chapter's plot, I certainly couldn't read it while drowsy or distracted.
I hear what you're saying - though I've had good luck (i.e. haven't had to give it a thought in two years) with Syncthing pumping my vaults (and all business files for that matter) to phone, laptop, and backup nas.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.