Hacker News new | past | comments | ask | show | jobs | submit login
Non-elementary group-by aggregations in Polars vs pandas (quansight.org)
135 points by rbanffy 46 days ago | hide | past | favorite | 74 comments



I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.

DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.


I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.

There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.

Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.


Spark docs are way too minimal for my taste, at least the API docs.


yeah i couldnt get it done in spark api had to combine spark and spark sql bc the window function i needed was (probably) not available in spark. it was inelegant i thought.


I have not worked with Spark, but I have used Athena/Trino and BigQuery extensively.

For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.

With a single node engine you have a ceiling how good it can get.

With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.


I'm using both Spark and polars, to me the appeal of polars is additionally it is also much faster and easier to set up.

Spark is great if you have large datasets since you can easily scale as you said. But if the dataset is small-ish (<50 million rows) you hit a lower bound in Spark in terms of how fast the job can run. Even if the job is super simple it take 1-2 minutes. Polars on the other hand is almost instantaneous (< 1 second). Doesn't sound like much but to me makes a huge difference when iterating on solutions.


>With a single node engine you have a ceiling how good it can get.

Well, you are a lot closer to that with Polars than with Pandas at least.


I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.


I thought the same thing about Spark, coming from R and later Pandas.


Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).

And I say this as someone who makes much of their living from Pandas.


I think pandas is well aware of some of the unfortunate legacy API decisions without Polars. They are trapped by backwards compatibility. Wes’ “Things I Hate About Pandas” post covers the highlights. Most of which boils down to having not put a layer between numpy and pandas. Which is why they were stuck with the unfortunate integer null situation.


Which is all stuff they could fix, if they'd be willing to, with a major version bump. They'd need a killer feature to encourage that migration though.


The really brutal thing is all of the code using Pandas written by researchers and non-software engineers running quietly in lab environments. Difficult to reproduce environments, small or non-existent test suites, code written by grad students long gone. If the Pandas interface breaks for installs done via `pip install pandas` it will cause a lot of pain.

With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.


Yes, but the interesting thing is that pandas developers are moving towards adopting polars-like advances now. Which wasn't the case previously.


The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.

On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?


Props to Ritchie Vink for designing polars.

But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.

The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.


I think it's worth giving some credit to prior art here. Polars was heavily inspired by the "Spark SQL" API, and Pandas by R data frames.

I disagree that Pandas is no longer state of the art: its interface is optimized for a different use case compared to Polars.


I think a lot of the weirdness in pandas comes from firstly the original focus on time series, and secondly the attempt to make logical indexing work like it did in R.

I'm not a big fan of the pandas API but it's a super useful tool nonetheless.


Well said.


Without having checked, maybe something like:

  select id, max(views) from <tbl>
  where sales > avg(sales) over (partition by id) group by 1
In dplyr, there is an ‘old style’ method which works on an intermediate ‘grouped data frame’ and a new style which doesn’t. In the old style:

  df |> group_by(id) |>
    filter(sales > mean(sales)) |>
    summarize(max(views))
In the new style, either:

  df |> filter(.by=id, sales>mean(sales)) |> summarize(.by=id,max(views))
Or:

  df |> summarize(.by=id, max(views[sales>mean(sales)]))


I understand that is the new style, but I'm not aware of the rationale of the switch to this method. The old style seems must clearer and easier to read.


Trying to find a good reference for you. There’s this from the author of dplyr:

https://news.ycombinator.com/item?id=30067406

And here in the docs:

https://dplyr.tidyverse.org/reference/dplyr_by.html

And maybe also:

https://www.tidyverse.org/blog/2023/02/dplyr-1-1-0-per-opera...

I guess part of it is that there’s some ‘non-locality’ in the pipeline where the grouping could be relatively distant from the operation acting on the grouped data. Similarly, you get to worry about eg grouping data that is already grouped.

I quite like the prql solution which is to have a ‘structured grouping’ where you have to delimit the pipeline that operates on grouped data, but maybe it can still lead to bad edits for complex queries.


Here's an example implementation in MSSQL - https://data.stackexchange.com/stackoverflow/query/edit/1873...

No need to filter within the window function if you use subquery or CTE, which is supported everywhere.


  -- "find the maximum value of 'views',
  -- where 'sales' is greater than its mean, per 'id'".

  select max(views), id  -- "find the maximum value of   'views',
  from example_table as et 
  where exists 
  (
    SELECT *
      FROM 
    ( 
      SELECT id, avg(sales) as mean_sales 
      FROM example_table
      GROUP by id 
    ) as f --  
    where et.sales  > f.mean_sales -- where 'sales' is greater than its mean
    and et.id = f.id 
  )
  group by id; -- per 'id'".


If I understand what you mean, it's supported everywhere that has windowing at all afaik.

If you mean filtering the rows in the window, you can do 'sum(case when condition then value else null end) over (window)'; if you mean selecting rows based on the value of a window function you use 'qualify' where supported or a trivial subquery or CTE and 'where' (which qualify is just shorthand for)


https://en.wikipedia.org/wiki/SQL?useskin=vector#Standardiza...

According to wikipedia, windowing was standardized back in 2003.


I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?


It really depends on what you are doing.

My book, Effective Polars, has a whole chapter devoted to the question of whether porting from pandas makes sense.


I converted to DuckDB and Polars. It’s worth it for the speed improvement.

However there are subtle differences between Pandas and Polars behaviors so regression testing is your friend. It’s not 1:1 mapping.


There's been so many subtle changes in pandas to pandas upgrades (especially groupby is somehow always hit), so regression tests are always needed...


Which things did you decide to move to duckdb?


A lot of aggregation, pivot and join logic, like:

    df = duckdb.query("from tbl1 join tbl2 using (id) where id is not null ").pl()
The .pl() is a Arrow-based conversation from DuckDB to Polars. It's in-memory and I believe zero-copy, so it happens almost instantaneously.

I go back and forth between DuckDB and Polars functions in the same scope because it's so cheap to convert between the two.


I'm genuinely curious, if you are already using (and assuming that you like) dataframe APIs, why you would use SQL?


It's the opposite; I prefer DuckDB and generally work with DuckDB's friendly SQL interface. SQL is declarative and is (for me) more intuitive than method-chaining -- especially for complex analytic operations that happen in one go.

(software people might beg to differ about the intuitive bit because they are more used to an imperative style, and to my surprise, even the best software engineers struggle with SQL because it requires one to think in set and relation operations rather than function calls, which many software folks are not used to)

I actually don't use the Polars dataframe APIs much except for some operations which are easier to do in dataframe form, like applying a Python function as UDF, or transposing (not pivoting) a dataframe.

Also Polars is good for materializing the query into a dataframe rapidly, which can then be passed into methods/functions. It's also a lot easier to unit test dataframes than SQL tables. There's a lot more tooling for that.


This is a prime use case for LLMs


It’s a very bad use case for an LLM because they don’t have as much polars code in their training data and the polars API has changed a lot in the past couple of years.

Copilot is still suggesting out of date polars code to me. ChatGPT and Claude regularly suggest operations that are valid in pandas and not in polars.


Can confirm this, LLMs write almost exclusively in broken polars. However I convert pandas files to polars semi-regularly, and it's still a hell of a lot faster to get an LLM to write the first draft and correct the 4 or 5 attribute errors it makes


Ah yes, just let the LLM refactor your numeric methods, #yolo.

What could go wrong?


The power of having an API that allows usage of the Free monad. And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.

Awesome! Didn't expect such a vast difference in usability at first.


I've always liked scatter solutions for these kind of problems:

  import numpy as np
  
  def scatter_mean(index, value):
      sums = np.zeros(max(index)+1)
      counts = np.zeros(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          sums[j] += value[i]
          counts[j] += 1
      return sums / counts
  
  def scatter_max(index, value):
      maxs = -np.inf * np.ones(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          maxs[j] = max(maxs[j], value[i])
      return maxs
  
  def scatter_count(index):
      counts = np.zeros(max(index)+1, dtype=np.int32)
      for i in range(len(index)):
          counts[index[i]] += 1
      return counts
  
  id = np.array([1, 1, 1, 2, 2, 2]) - 1
  sales = np.array([4, 1, 2, 7, 6, 7])
  views = np.array([3, 1, 2, 8, 6, 7])
  means = scatter_mean(id, sales).repeat(scatter_count(id))
  print(views[sales > means].max())
Obviously you'd need good implementations of the scatter operations, not these naive python for-loops. But once you have them the solution is a pretty readable two-liner.


I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.

Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.


Data point: I have a medium-complexity data transformation use case that I still prefer pandas for.

Reason: I can speed things up fairly easily with Cython functions, and do multithreading using the Python module. With polars I would have to learn Rust for that.


If I'm doing some data science just for fun and personal projects, is there any reason to not go with Polars?

I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?


Pandas can be easier for interactive poking around on smaller data sets. It feels more like Excel, whereas Polars feels more like SQL. It's no surprise that on a forum full of programmers Polars is widely preferred, but IMO it's worth at least trying both.


I adore polars for its speed and I find the interface easier than pandas. But pandas still has a richer ecosystem of stuff built around it. I try to use polars in greenfield things but occasionally get yoinked back.


Go with polars, you can always convert a polars data frame back when needed.


Does anyone have a good heuristic for when a dataframe library is a good tool choice? I work on a team that has a lot of data scientists and a few engineers (including myself) and I often see the data scientists using dataframes when simple python classes would be much more appropriate so that you have a better sense of the object you're working with. I'm been having a hard time getting this idea across to people though.


Frankly, the heuristic I've developed over the past few years working on a team that sounds like yours is: The data scientists are probably right.

If you're actually operating on an object, ie. the equivalent to a single row in a dataframe, then yeah, it's silly to use a dataframe library. But if you're operating on N value objects ... yeah you probably want a dataframe with N rows and a column for each field in your object.

Your mileage may vary I guess, but I resisted this for quite awhile and I now think I was the one who was wrong.


I appreciate this.

Most software devs are used to working 1-dimensional collections like lists, or tree-like abstractions like dicts, or some combination of those. This is why most abstractions and complex data types are built on these. Objects are a natural progression.

But in the data world, high-dimensions are modeled using dataframes (analogously, tables). This is a paradigm shift for most pure software people because manipulating tables typically require manipulating sets and relations. Joining two tables requires knowing how the join-columns relate (inner, full outer, left, right, cross). Aggregation and window functions require thinking in terms of sub-groupings and subsets.

It's not super intuitive unless you have to manipulate and reshape large amounts of data every day, which data scientists and data engineers have to do, and which software engineers typically don't do very often. It's just a kind of muscle memory that gets developed.

I definitely had trouble at first getting software engineers to buy into DuckDB because they were unfamiliar and uncomfortable with SQL. Fortunately some of them had a growth mindset and were willing to learn, and those folks now have now acquired a very powerful tool (DuckDB) and a new way of thinking about large-data manipulation. When data is a certain size, iterative constructs like "for" loops become impractical.


Nailed it.


DataFrames are easy to use, everyone knows how to use them, you can move fast, and it's easy to iterate and test differences between things, and reviewing the code is a breeze.

That said, my team moved to polars about a year ago and we haven't looked back.


> reviewing the code is a breeze

I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase.

Now I've been put on a Pandas project and it's full of mysterious

    df = df[df["thing"] == "value"]
I just feel like we've gone back to the unreadability of dictionaries.

Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise.

Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code.

Personally I only use Pandas if I absolutely need it for performance, as a last resort.


Is it just the generic, non-descriptive naming, or what do you think is the root of your distaste for pandas?

Like if we have a dataclass:

    obj.thing == value
Or SQL:

    SELECT * FROM table WHERE thing = ‘value’
We don’t know what the types are, either, without looking it up.

The fact the dataframe often changes halfway through the program is, I think, more to do with the task at hand, that often pandas is being used to perform data transformation (the T in ETL), where some raw data is read in, and the goal is literally to change the structure to clean it up and normalize it, so the data can be ingested into a SQL table in a consistent form with other data points.

But if transformation is not what you are doing, then yes, that might not be the right use of dataframes.


With the dataclass I can look at the class definition, with SQL I can look at the database schema in git, or at the very least log in and `DESCRIBE table`. With Pandas I can find where the dataframe is defined, but then I need to walk through any structural transformations made to it and keep track of its structure in my head. Alternatively I can run the Pandas program in a debugger, set a breakpoint and inspect the dataframe.

With all you need to do some work, but I find the Pandas one more involved because you don't have an authoritative "reference", just an initial state then some transformations. With the Pandas example I have to run the program (in my head or actually). The program might need to pull in test data (hopefully some has been provided). The worst is when the structure of the DF is derived from the source data rather than stated in code (e.g. reading in a CSV). It's much more to do than looking at a class definition or declarative database schema; there's a "sequence" to it, there are transformation steps happening to the DF that I need to keep track of.


As for the transformation thing, I'm totally on board with the need to transform data. What I'm specifically objecting to is the pattern of changing a variable's type during the program, which is extremely common in Pandas code. That is, reassigning the same variable with a value that has a different structure or types.

Here's a really common example where we select a subset of fields:

    df = ...
    df = df[["field1", "field2"]]
The DF has been transformed to have fewer columns than it did previously. Representing as types, it went from List[Dict[Literal["field1","field2","field3"], int]] to List[Dict[Literal["field1","field2"], int]]. We can no longer rely on field3 existing in the DF. Because this one variable has two possible types depending on where you are in the program, it's much harder to reason about.

This is a totally valid way to transform the data, but the manner in which the transformation is happening, I find, makes the code harder to reason about. And this is the manner of transformation I find most commonplace in Pandas practice. We could instead do the following, but I don't see it much:

    df = ...
    df_limited_fieldset = df[["field1", "field2"]]
And even in this case, to infer the structure of df_limited_fieldset, you need to determine the structure of df and then apply a transformation to it, unless you explicitly document its new structure somehow. With dataclasses, df_limited_fieldset would contain instances of an entirely new dataclass, stating its new type.

None of this is to say that abuse of dynamic types doesn't happen in normal Python, it totally does, but I've found these patterns to be so ingrained in Pandas culture / common practice that I'm comfortable characterising them as part of the nature of the tool.


Do we work at the same company?

You put it much better than I could have. Do you know if polars at all solves the problem of having opaque, mutable objects everywhere? I feel like there's a good market for having a dataframe library that's easier to reason about in your editor. It could even be a wrapper around pandas that adds rich typing sort of the way FastAPI does with Pydantic for Starlette.


With Polars you use `df.select()` or `df.with_columns()` which return "new" DataFrames - so you don't have mutable objects everywhere.

There is an SO answer[1] by the Polars author which may have some relevance.

[1]: https://stackoverflow.com/questions/73934129/


Why not sql?


Editor completion. Programmability "out-of-the-box" (rather than having to generate SQL using another programming language).


Why would you generate SQL using another programming language? To me that sounds like something you'd only do if you're deep in an ORM with no escape hatch. For data analysis tasks, that's extremely unergonomic and you should definitely just write normal SQL. Use the ORM for CRUD. I've never seen an ORM that won't let you drop down to regular SQL for ad-hoc queries.

Editor completion is an extremely low ranking aspect for choosing technologies for data analysis. If SQL is the better tool but you're not using it because it doesn't have editor completion, then you need a better editor. It pains me when people prioritise "developer experience in VS Code" over "actually the correct technological choice".


Can I ask what you do for version control of SQL?

I ask this sincerely, as I’ve seen many scenarios over the years where a tool like Django is used to manage a Postgres database, solely that the schema migrations are captured in version control. A .sql file can be in version control, but rolling back a git commit is not the same as being able to roll back a schema migration in the actual database.


Sure, happy to go into it. Firstly we need to distinguish between version control of DQL and DDL.

In the context of comparing SQL and Pandas, we're mostly talking about DQL, so version control for this is exactly the same as any other query code.

For the DDL side that you're asking about, indeed just version controlling SQL files won't work, you need to use a proper migration tool. There are many available that do pure SQL migrations in the same way Django works (a sequence of up/down operations to perform or revert a migration). Personally I use migrate[1].

So you really can achieve the same thing. Personally I like this approach because it can do anything Django can, but it can also do more complicated migrations that can't be expressed by an ORM's DB-agnostic abstract layer. Pure SQL migrations also decouple your database schema from your ORM/framework, which I think is sensible anyway; databases & data tend to be much stickier than apps & frameworks.

A downside here is that you need to do some extra work to keep models in sync with the schema. But the point I was making with my original post is that you can totally use ORMs for CRUD and auto-generate away all that boilerplate, _in addition to_ using raw SQL for complicated data analysis. My point is it's not binary, you can mix them, and there's nothing forcing you to generate _all_ your queries if you use an ORM for most of them. Use the ORM to manage schema, then SQL for a one-off query - still works.

[1]: https://github.com/golang-migrate/migrate


You roll back a schema migration on a dev/test database by nuking it and regenerating it, probably with the same tools you use to manage the rest of your environment.

You don't rollback a schema migration to a production database, because that is impossible. What has been done cannot be undone.


What's the "right editor" for SQL?

"Correct technological choice": I think relational algebra style APIs (a la Polars) are the "correct technological choice" here. SQL is just a tool to express relational algebra, and I'm not sure it's a good one.


I personally love any of the IDEA platform products, such as IntelliJ Ultimate, DataGrip, and also DataSpell.

For strictly analysis, I’d recommend DataSpell as it’s more oriented towards that. Additionally, it has built in functionality to do sql queries in notebook cells and will save to a variable which is a data frame.


Performance is my heuristic. I can't make it quantitative, because 100M records in 1 minute might be considered fast for some use cases, but slow for others. For me it's the qualitative "is this thing too slow?".

Personally, I see a dataframe library as a last resort. I prefer to improve the algorithm, or push more things into the database, or even scale up the hardware in some cases. If I've exhausted all other options and it's still too slow, then I use a dataframe library.

But then I'm not a data scientist. I've found that data scientists have a hammer that they really really like, so they'll use it everywhere.


I go through this all the time with my clients. Short answer: it really depends.

(I spend a good deal of my time and helping client use pandas and Polars.)


I’m just using Ibis: https://ibis-project.org/ They provide a nice backend agnostic API. For most backends it will just compile to SQL and act as a query builder. SQL basically has solved the problem of providing a declarative data transformation syntax, why reinvent the wheel?


I' m tired of remembering of all these library invented concepts and prefer doing brainless for loops to process data in Julia.


Outside relatively straightforward operations working with groups with "expressions" gets really painful. And Python gets too slow fast. And Pandas (multi-)indexing madness is a constant source of bugs and frustration.

Sad that Julia is unusable for my workflows, and will likely remain so.


You’re tired, but I don’t know if I have enough brains to rewrite the API “query” from this article using brainless loops.


You can do brainless for loops in Python really easily.

Oh yeah, toss numba or cython on top and you are back to numpy speed...


Nope they are still libraries with their own feature set that need remembering.


Ok, if you consider @jit a feature set...

(Admittedly, there are more features, but in my book, I demonstrate pure Python code that runs as fast as numpy and cython with this decorator added.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: