Does anyone have a good heuristic for when a dataframe library is a good tool ch...

sanderjd · 2024-11-16T02:25:41 1731723941

Frankly, the heuristic I've developed over the past few years working on a team that sounds like yours is: The data scientists are probably right.

If you're actually operating on an object, ie. the equivalent to a single row in a dataframe, then yeah, it's silly to use a dataframe library. But if you're operating on N value objects ... yeah you probably want a dataframe with N rows and a column for each field in your object.

Your mileage may vary I guess, but I resisted this for quite awhile and I now think I was the one who was wrong.

wenc · 2024-11-16T19:46:38 1731786398

I appreciate this.

Most software devs are used to working 1-dimensional collections like lists, or tree-like abstractions like dicts, or some combination of those. This is why most abstractions and complex data types are built on these. Objects are a natural progression.

But in the data world, high-dimensions are modeled using dataframes (analogously, tables). This is a paradigm shift for most pure software people because manipulating tables typically require manipulating sets and relations. Joining two tables requires knowing how the join-columns relate (inner, full outer, left, right, cross). Aggregation and window functions require thinking in terms of sub-groupings and subsets.

It's not super intuitive unless you have to manipulate and reshape large amounts of data every day, which data scientists and data engineers have to do, and which software engineers typically don't do very often. It's just a kind of muscle memory that gets developed.

I definitely had trouble at first getting software engineers to buy into DuckDB because they were unfamiliar and uncomfortable with SQL. Fortunately some of them had a growth mindset and were willing to learn, and those folks now have now acquired a very powerful tool (DuckDB) and a new way of thinking about large-data manipulation. When data is a certain size, iterative constructs like "for" loops become impractical.

sanderjd · 2024-11-16T21:17:07 1731791827

Nailed it.

internet101010 · 2024-11-16T02:25:44 1731723944

DataFrames are easy to use, everyone knows how to use them, you can move fast, and it's easy to iterate and test differences between things, and reviewing the code is a breeze.

That said, my team moved to polars about a year ago and we haven't looked back.

RadiozRadioz · 2024-11-16T05:50:01 1731736201

> reviewing the code is a breeze

I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase.

Now I've been put on a Pandas project and it's full of mysterious

    df = df[df["thing"] == "value"]

I just feel like we've gone back to the unreadability of dictionaries.

Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise.

Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code.

Personally I only use Pandas if I absolutely need it for performance, as a last resort.

halfcat · 2024-11-16T06:36:33 1731738993

Is it just the generic, non-descriptive naming, or what do you think is the root of your distaste for pandas?

Like if we have a dataclass:

    obj.thing == value

Or SQL:

    SELECT * FROM table WHERE thing = ‘value’

We don’t know what the types are, either, without looking it up.

The fact the dataframe often changes halfway through the program is, I think, more to do with the task at hand, that often pandas is being used to perform data transformation (the T in ETL), where some raw data is read in, and the goal is literally to change the structure to clean it up and normalize it, so the data can be ingested into a SQL table in a consistent form with other data points.

But if transformation is not what you are doing, then yes, that might not be the right use of dataframes.

RadiozRadioz · 2024-11-16T09:18:44 1731748724

With the dataclass I can look at the class definition, with SQL I can look at the database schema in git, or at the very least log in and `DESCRIBE table`. With Pandas I can find where the dataframe is defined, but then I need to walk through any structural transformations made to it and keep track of its structure in my head. Alternatively I can run the Pandas program in a debugger, set a breakpoint and inspect the dataframe.

With all you need to do some work, but I find the Pandas one more involved because you don't have an authoritative "reference", just an initial state then some transformations. With the Pandas example I have to run the program (in my head or actually). The program might need to pull in test data (hopefully some has been provided). The worst is when the structure of the DF is derived from the source data rather than stated in code (e.g. reading in a CSV). It's much more to do than looking at a class definition or declarative database schema; there's a "sequence" to it, there are transformation steps happening to the DF that I need to keep track of.

RadiozRadioz · 2024-11-16T09:19:20 1731748760

As for the transformation thing, I'm totally on board with the need to transform data. What I'm specifically objecting to is the pattern of changing a variable's type during the program, which is extremely common in Pandas code. That is, reassigning the same variable with a value that has a different structure or types.

Here's a really common example where we select a subset of fields:

    df = ...
    df = df[["field1", "field2"]]

The DF has been transformed to have fewer columns than it did previously. Representing as types, it went from List[Dict[Literal["field1","field2","field3"], int]] to List[Dict[Literal["field1","field2"], int]]. We can no longer rely on field3 existing in the DF. Because this one variable has two possible types depending on where you are in the program, it's much harder to reason about.

This is a totally valid way to transform the data, but the manner in which the transformation is happening, I find, makes the code harder to reason about. And this is the manner of transformation I find most commonplace in Pandas practice. We could instead do the following, but I don't see it much:

    df = ...
    df_limited_fieldset = df[["field1", "field2"]]

And even in this case, to infer the structure of df_limited_fieldset, you need to determine the structure of df and then apply a transformation to it, unless you explicitly document its new structure somehow. With dataclasses, df_limited_fieldset would contain instances of an entirely new dataclass, stating its new type.

None of this is to say that abuse of dynamic types doesn't happen in normal Python, it totally does, but I've found these patterns to be so ingrained in Pandas culture / common practice that I'm comfortable characterising them as part of the nature of the tool.

kolja005 · 2024-11-20T07:10:03 1732086603

Do we work at the same company?

You put it much better than I could have. Do you know if polars at all solves the problem of having opaque, mutable objects everywhere? I feel like there's a good market for having a dataframe library that's easier to reason about in your editor. It could even be a wrapper around pandas that adds rich typing sort of the way FastAPI does with Pydantic for Starlette.

cmdlineluser · 2024-11-20T19:35:25 1732131325

With Polars you use `df.select()` or `df.with_columns()` which return "new" DataFrames - so you don't have mutable objects everywhere.

There is an SO answer[1] by the Polars author which may have some relevance.

[1]: https://stackoverflow.com/questions/73934129/

_boffin_ · 2024-11-16T02:48:13 1731725293

Why not sql?

bmer · 2024-11-16T03:01:25 1731726085

Editor completion. Programmability "out-of-the-box" (rather than having to generate SQL using another programming language).

RadiozRadioz · 2024-11-16T06:07:49 1731737269

Why would you generate SQL using another programming language? To me that sounds like something you'd only do if you're deep in an ORM with no escape hatch. For data analysis tasks, that's extremely unergonomic and you should definitely just write normal SQL. Use the ORM for CRUD. I've never seen an ORM that won't let you drop down to regular SQL for ad-hoc queries.

Editor completion is an extremely low ranking aspect for choosing technologies for data analysis. If SQL is the better tool but you're not using it because it doesn't have editor completion, then you need a better editor. It pains me when people prioritise "developer experience in VS Code" over "actually the correct technological choice".

halfcat · 2024-11-16T06:44:58 1731739498

Can I ask what you do for version control of SQL?

I ask this sincerely, as I’ve seen many scenarios over the years where a tool like Django is used to manage a Postgres database, solely that the schema migrations are captured in version control. A .sql file can be in version control, but rolling back a git commit is not the same as being able to roll back a schema migration in the actual database.

RadiozRadioz · 2024-11-16T08:12:59 1731744779

Sure, happy to go into it. Firstly we need to distinguish between version control of DQL and DDL.

In the context of comparing SQL and Pandas, we're mostly talking about DQL, so version control for this is exactly the same as any other query code.

For the DDL side that you're asking about, indeed just version controlling SQL files won't work, you need to use a proper migration tool. There are many available that do pure SQL migrations in the same way Django works (a sequence of up/down operations to perform or revert a migration). Personally I use migrate[1].

So you really can achieve the same thing. Personally I like this approach because it can do anything Django can, but it can also do more complicated migrations that can't be expressed by an ORM's DB-agnostic abstract layer. Pure SQL migrations also decouple your database schema from your ORM/framework, which I think is sensible anyway; databases & data tend to be much stickier than apps & frameworks.

A downside here is that you need to do some extra work to keep models in sync with the schema. But the point I was making with my original post is that you can totally use ORMs for CRUD and auto-generate away all that boilerplate, _in addition to_ using raw SQL for complicated data analysis. My point is it's not binary, you can mix them, and there's nothing forcing you to generate _all_ your queries if you use an ORM for most of them. Use the ORM to manage schema, then SQL for a one-off query - still works.

[1]: https://github.com/golang-migrate/migrate

bcoates · 2024-11-16T07:11:56 1731741116

You roll back a schema migration on a dev/test database by nuking it and regenerating it, probably with the same tools you use to manage the rest of your environment.

You don't rollback a schema migration to a production database, because that is impossible. What has been done cannot be undone.

bmer · 2024-11-18T00:51:31 1731891091

What's the "right editor" for SQL?

"Correct technological choice": I think relational algebra style APIs (a la Polars) are the "correct technological choice" here. SQL is just a tool to express relational algebra, and I'm not sure it's a good one.

_boffin_ · 2024-11-20T01:32:30 1732066350

I personally love any of the IDEA platform products, such as IntelliJ Ultimate, DataGrip, and also DataSpell.

For strictly analysis, I’d recommend DataSpell as it’s more oriented towards that. Additionally, it has built in functionality to do sql queries in notebook cells and will save to a variable which is a data frame.

RadiozRadioz · 2024-11-16T06:00:22 1731736822

Performance is my heuristic. I can't make it quantitative, because 100M records in 1 minute might be considered fast for some use cases, but slow for others. For me it's the qualitative "is this thing too slow?".

Personally, I see a dataframe library as a last resort. I prefer to improve the algorithm, or push more things into the database, or even scale up the hardware in some cases. If I've exhausted all other options and it's still too slow, then I use a dataframe library.

But then I'm not a data scientist. I've found that data scientists have a hammer that they really really like, so they'll use it everywhere.

__mharrison__ · 2024-11-16T03:09:11 1731726551

I go through this all the time with my clients. Short answer: it really depends.

(I spend a good deal of my time and helping client use pandas and Polars.)