Hacker News new | past | comments | ask | show | jobs | submit login
Farewell Pandas, and thanks for all the fish (ibis-project.org)
228 points by nojito 74 days ago | hide | past | favorite | 138 comments



> NULL indicates a missing value, and NaN is Not a Number.

That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.

There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.


Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these

> 4. Doing missing data right > All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.

> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.

> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.

https://wesmckinney.com/blog/apache-arrow-pandas-internals/


Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.


Yes I think that early post discussed about upcoming Pandas 2.0 which now realised 7 years later.


100% this.

We'd love to have a docs contribution that lays this out in detail if you'd be up for it!

Disclaimer: (I lead the project and work on it full time).


Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?


We just don’t have the technology for that, else there’d be libraries floating around that let you do that.


> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.

I'm not sure I understand this. Can you explain it with an example?

I'm thinking something like:

    let (x, y) = NULL;
But then I'm stuck.


What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.

One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.

We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.

[1] http://www.esp.org/foundations/database-theory/holdings/codd...


s/Franklin/Codd/

s/represent the above/relax the above/

(Sorry!)


In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).

In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.

Edit:

A related aside is that Django's ORM's handling of NULL gives me freaking hives. From https://docs.djangoproject.com/en/5.1/ref/models/fields/#nul...:

> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.

I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.


In Oracle, "" is converted to NULL so maybe that's where they get that from.


Whoa, you're right.[0] I wouldn't have expected that at all.

[0] https://docs.oracle.com/en/database/oracle/oracle-database/1...


You can almost count on Oracle doing the wrong thing…

It never ceases to amaze me they got themselves into a position where they had to introduce a VARCHAR2 type.


Not surprising. There are much better compute engines than pandas.

Folks then ask why not jump from pandas to [insert favorite tool]?

- Existing codebases. Lots of legacy pandas floating about.

- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.

- YAGNI - For lots of small data tasks, pandas is perfectly fine.


Even medium tasks, the slow step is rarely computations, but the squishy human entering them.

When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.

Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.


All great reasons to stick with Pandas. If a thing works for you, then that's one less thing to manage.

Fully agree that if pandas is working for you, then you're better off sticking with it!


> YAGNI - For lots of small data tasks, pandas is perfectly fine.

There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.

It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.

(Also love your work, thank you)


But... but we have to have big data and a cluster...

The hoops folks will jump through...

(Thanks!)


If your task is too big for pandas, you should probably skip right over dask and polars for a better compute engine.


Jump straight to cluster and skip Dask?

Not sure what "big" means here, but a combination of .pipe, pyarrow, and polars can speed up many slow Pandas operations.

Polars streaming is surprisingly good for larger than RAM. I get that clusters are cool, but I prefer to keep it on a single machine if possible.

Also, libraries like cudf can greatly speed up Pandas code on a single machine, while Snowpark can scale Pandas code to Snowflake scale.


In my experience, Polars streaming runs out of memory at much smaller scales than both DuckDB and DataFusion and tends to use much more memory for the same workload when it doesn't outright segfault.

Polars is faster than those two once you get to less than a few GB, but beyond that you're better off with DuckDB or DataFusion.

I would love for this to improve in Polars, and I'm sure it will!


Do you mean segfault or OOM? I am not aware of Polars segfaulting on high memory pressure.

If it does segfault, would you mind opening an issue?

Some context; Polars is building a new streaming engine that will eventually be ready to run the whole Polars API (Also the hard stuff) in a streaming fashion. We expect the initial release end of this year/early next year.

Our in-memory engine isn't designed for out-of-core processing and thus if you benchmark it on restricted RAM, it will perform poorly as data is swapped or you go OOM. If you have a machine with enough RAM, Polars is very competitive in performance. And in our experience it is tough to beat in time-series/window functions.


Segmentation violations are often the result of different underlying problems, one of which can be running out of memory.

We (the Ibis team) have opened related issues and the usual response is to not use streaming until it's ready, or to fix the problem if it can be fixed.

Not sure what else there is to do, seems like things are working as expected/intended for the moment!

We'll definitely be the first to try out any improvements to the streaming engine.


They have different implications for us. An abort due to an OOM isn't a bug in our program, as SEGFAULT is a serious bug we want to fix.


when you say "our in-memory engine", are you talking about dataframe or the lazyframe?


A DataFrame is our in memory table. A LazyFrame is a compute plan that can have DataFrames as source.

The engine is what executes our plans and materializes a result. This is plural as we are building a new one.


My understanding is that the Polars team is working on a new streaming engine. It looks like you will get your wish.


Nice to see, over the past months I've replaces pandas with ibis in all new projects and I am a huge fan!

- Syntax in general feels more fluid than pandas

- Chaining operations with deferred expressions makes code snippets very portable

- Duckdb backend is super fast

- Community is very active, friendly and responsive

I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)


Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category.

[1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...


Multi-indexes definitely have their place. In fact, I got involved in pandas development in 2013 as part of some work I was doing in graduate school, and I was a heavy user of multi-indexed columns. I loved them.

Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.

We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.


> Take a look at the indexing code in pandas

As the end-user, not quite my concern.

> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning

I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.


> As the end-user, not quite my concern.

Complicated code is (probabilistically) slow, buggy, infrequently updated code. By all means, if it looks like a good enough tool for the job (especially if the alternatives don't) then use it anyway, but that's slightly different from it not being your concern.

I've seen enough projects need "surprise" major revisions because some team tried to sneak a dataframe into a 10M QPS service that my default is keeping pandas far away from anything close to a user-facing product.

I've also seen costs balloon as the data's scale grows beyond what pandas can handle, but basically all the alternatives suck for myriad reasons, so I don't try to push "not pandas" in the data backend. People can figure out what works for themselves, and I kind of like just writing it from scratch in a performant language when I personally hit that bottleneck.


I work a lot with IoT data, where basically everything is multi-variate time-series from multiple devices (at different physical locations and logical groupings). Pandas multi index is very nice for this, at least having time+space in the index.


Is your workload mostly single-threaded? If so, is that due to dataset size, or machine core count?


Sorry I don't know what to answer. I don't think what I do qualifies as "workload".

I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.

The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.


Curious if you considered Polars. That's become the defacto standard in my group as we all dislike pandas.


Polars is a great choice but, like pandas, locks you into its execution engine(s). Polars is a supported backend for Ibis, though depending on your use case DuckDB or DataFusion may scale better. we did some benchmarking recently: https://ibis-project.org/posts/1tbc/


Also depending on your use case, Ibis may not support the part of the Polars API you actually need

(to be clear, I'm a fan of "DuckDB via Ibis", I'm much less a fan of statements to the effect of "use Polars via Ibis instead of Polars directly", as I think they miss a lot :wink:)


We support Polars as a backend.

DuckDB was chosen as the default because Polars was too unstable to be the default backend two years ago.


Good riddance.

Pandas turns 10x programmers into 1x programmers.


I've been using pandas for over a decade and I'm very efficient with it.

The Modern Pandas blog series from Tom Augsburger is excellent, and the old Wes McKinney book (creator of pandas) gave a good glimpse into how to effectively use it.

I don't think many people learn it correctly so it gets used in a very inefficient and spaghetti script style.

It's definitely a tool that's showing it's age, but it's still a very effective swiss army knife for data processing.

I am glad we're getting better alternatives, Hadley Wickams tidyverse really showed what data processing can be.


I also recommend Matt Harrison's talks ("Idiomatic Pandas") and book ("Effective Pandas").

Once I adopted method chaining, a lot of the issues that I had with pandas in the past due to poor style (e.g. SettingWithCopyWarning) pretty much disappeared.


Just imagine how much more efficient you would be if you were using R's DataTable.

Look, I applaud your skill, but at some point even a master craftsman realizes that the swiss army knife may not be the best tool, and a leatherman offers certain advantages.


I really like R’s library and I’ll use them any chance I get (libraries like lmer are still orders of magnitude more efficient than the same model in Statsmodels).

From my experience the biggest impediment to using R in production is many orgs don’t have a blessed way to run it.

R is my favourite language for data processing, the manual section Computing on the Language[1]is why R is such an ergonomic tool. I had hoped Julia would catch up, but Julia’s macros are not comparable in their depth.

I think pandas is probably the data equivalent of editing files using default vim or processing data with awk.

[1] https://rstudio.github.io/r-manuals/r-lang/Computing-on-the-...


As a joke, I wrote an Ibis backend (https://github.com/cpcloud/ibish) that processes expressions using shell commands strung together with named pipes. It supports joins using the coreutils join command, projections, filters and some aggregations with awk.

It's faster than pandas in some cases and folks should put it into production immediately!


Sometimes I write data analysis code in R to make myself aware of how it is occasionally possible to have nice things.


fun fact if you weren't aware, Ibis was also originally created by Wes McKinney and takes a lot of inspiration from the tidyverse


I was not aware of that, the influence from the tidyverse is very clear

I’ve tried to make mockups of similarly inspired API and I think they did a great job with clever tricks like the `_` reference (does that clash with IPython/Jupyter’s use?).

I do wonder how they manage compatibility with so many backends, it seems like many features will be directly tied to your backend (e.g. Trino using Java Pattern for regex vs BigQuery using Re2) in hard to explain ways. But maybe that’s not a big concern, because very often you’re only going to be using a couple of backends.

I’ll have to try Ibis out for myself, it looks like it can unify a lot of the work I have to do. I’ve moved away from pandas for all but the last mile of computation, so this might be a direct good SQL replacement.


`_` definitely clashes with IPython and Jupyter's use of it.

However, if you import it from Ibis then it ceases to be used as "most recent result" and remains the ibis underscore object (unless of course you explicitly assign it to something else).

Regarding backend compatibility, there are definitely a few kinds of things that we don't currently abstract over. One is regular expression syntax and another is floating point math (e.g., various algebraic properties that are violated that result in slightly different outputs).

Hope you give it a go, and please report issues at https://github.com/ibis-project/ibis.


Thanks for the reply!

I think that’s the right policy to take, and I did notice the support matrix on your website which addresses my earlier question:

https://ibis-project.org/backends/support/matrix


Curious if you could expand on this?


Lol. As if different tools for filtering rows/columns determines a programmer's capability.


What is the difference, other than parallel execution?


- Much faster and more memory efficient.

- Consistent Expression and SQL-like API.

- Lazy execution mode, where queries are compiled and optimized before running.

- Sane NaN and null handling.

- Much faster.

- Much more memory efficient.


Do you use it over many machines (RAMs)?

I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.

Although the Null handling seems very compelling, I guess it comes at a cost of incompatibility with existing libraries, otherwise Pandas would have implemented it as well?

I am curious about the SQL api though.


Polars' support for SQL is pretty nascent and missing a lot of functionality.

If it were better, we'd use it internally in Ibis for the Polars backend implementation.

If you're going down the mixed SQL, DataFrame API route then Ibis is probably the best solution out there for that.

I work on Ibis, so take what I say with a grain of salt. There may yet be other libraries out that there that have similar functionality.


> Do you use it over many machines (RAMs)?

If you mean whether I run it distributedly a la Spark then no. If you mean whether I test it on various machines with different RAM sizes then yes.

> I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.

Well, I care. Both pandas and polars are, to my view, single-machine dataframe library, so the memory and CPU constraints are rather stringent.

My comparison is based solely on my experience: reading csv files that are 20% to 50% the size of RAM, pandas takes (or errors out after) 2 to 10 minutes, while polars finishes in 20 seconds. Queries in pandas are almost always slower than polars.

But reading your comment, it seems you and I have different use cases for dataframe libraries, which is fine. I mostly use them for exploratory analysis, so the SQL api is not that much of a plus to me, but the performance is.


My point is that it is still not a magnitude change. And it (probably?) introduces bugs and incompatibilities.

Many cloud providers now offer serverless SQL and Spark capacities (serverless=no set up for you). This is the magnitude change for me.

With pandas you can maybe process 10 million rows, with polars maybe 50 million. But with a distributed service maybe 100 times more?


When using Pandas appropriately, that is with method chaining, lambda expressions (instead of intermediate assignments) and pyarrow datatypes, you also get much faster speed and null values handling.


I know.

And by now I know that very well.

Like someone-screaming-in-my-ears-know.

I am starting to think that Polars is showing all the signs of a hype or a cult.

I am still not convinced, particularly since the community feels more like a marketing department than someone who wants to genuinely help.

I can do that thing you describe with SQL.


Did it get more memory efficient during the time between authoring point 1 and authoring the last point?


You may think this was clever, but it is a common literary technique to emphasize a point.

I'd suggest less snark, you're not doing yourself any favors.


>I'd suggest less snark, you're not doing yourself any favors.

Why do people feel the need to jump in and police tone like this? Who are you? You're not doing yourself any favours, either.

>common literary technique to emphasize a point.

"Common" is a stetch, and who cares.


> Why do people feel the need to jump in and police tone like this?

From the community guidelines(https://news.ycombinator.com/newsguidelines.html): "Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes."

> Who are you?

A member of the community.


> Why do people feel the need to jump in and police tone like this?

Did you know that more information is communicated via tone than words?


This irritates me. What was the point of THEIR comment? To be cunty? It absolutely was NOT a productive comment. They were being an ass. And you’re defending them being an ass, asking me who I am like I’m somehow not allowed to point out someone being a cunt unless I have some type of status symbol of which you need to approve. At this point I can only fathom you’re a coworker or friend of theirs, hence the defense. Nothing else makes sense.


No idea who the person is who defended what I said. If they're a co-worker of mine, I don't know who it is.

The original purpose of my comment was to, in a light-hearted way, point out redundant information, not to incite a flame war about tone on HN.

This thread reminds me of why I don't come here very often.


I wouldn't switch to Polars as a replacement to Pandas, given that Polars also seems likely to die soon. I'd also recommend avoiding Manatees.


On the flip side, Ibises (Ibisii ??) have been around for years and aren't going anywhere.


Why do you think polars is likely to die soon?


Global warming...


Sadly, I don't think many caught the joke :)


One thing I do like about pandas is it’s pretty extensible to columns of new types. Maybe I’m missing something, but does Polars allow this? Last time I checked there wasn’t a clear path forward.


I'm curious, when do you want to add a new column type? It's not a situation I've been in.


GeoPandas is one popular library (https://geopandas.org/en/stable/)

I added a column type for full text search, searching tokenized text (https://github.com/softwaredoug/searcharray)

The pandas support for extensions is very robust with extensive testing to make sure you can do all the data aggregations functionality you expect in and pandas column.


OTOH the list and struct types allow lots of complex operations, e.g with `.eval`, which run blazingly fast.


Ibis allows this with custom datatypes.


What is this exactly?

The front page doesn't make it clear to make what ibis is besides being an alternative to pandas/polars.


Check the project’s left nav, https://ibis-project.org/why

I didn’t know either. It’s a python dataframe library/api.


Never heard of ibis before but the front page give a pretty good overview of what it is to me at least. Looks like a dataframes api that can be executed on a variety of backend engines, both distributed and local.

Somewhat analogous to how the pandas api can be used in pyspark via pyspark pandas, but from the api -> implementation direction rather than (pandas) implementation -> api -> implementation maybe?


As far as I can tell, it is an ORM for data engineers. They can write Python code that gets translated to either SQL or some other language understood by the DB engine that actually runs it.


Ibis is not an ORM. It has no opinion of or knowledge about your data model, or relationships between tables and their columns.

Check out SQLAlchemy's ORM (https://docs.sqlalchemy.org/en/20/orm/) for what that looks like.


Huge fan of Ibis, the value isn't that you can now use DuckDB... it's that your syntax will work when the next cool thing arrives too


Personnally, I tend to use Pandas because it is integrated everywhere, because of the ecosystem that uses it. Let's say I want to read data from json file (csv file, python dict, etc.) and I want to plot it using plotly. If Ibis is compatible with whatever Pands dataframe is compatible with, then for most of my usage I don't really care much about the "backend".


About time. It always surprises me how long pandas has been able to hold on. Wes McKinney talked about pandas' limitations back in 2017 https://wesmckinney.com/blog/apache-arrow-pandas-internals/


In my experience, the best thing about Pandas is how much it made me appreciate using dplyr and the tidyverse. If it wasn't for Pandas, I may not be the avid R user I am today.


I will try it out but honestly the pain-point for me is the library api to pandas is not always intuitive/natural to how things should be in Python. The NaN/None bit is annoying but I find that to be a minor annoyance.


I've only heard about Ibis maybe three times in the past two years and I pay pretty close attention to the space. If Ibis moves away from pandas, then it just means that I am less likely to try Ibis because there is no bridge.

Sure, hip new frameworks are moving away from pandas/numpy, but I'll wait 5 years for the dust to settle here while the compatibility and edge cases sort themselves out. The pydata/numfocus ecosystem is extensive.

It's just tabular data. So what if I have to wait a few more milliseconds to get my result.


> If Ibis moves away from pandas, then it just means that I am less likely to try Ibis because there is no bridge.

the bridge is that Ibis accepts pandas as input and has a `to_pandas()` method as output

Ibis also still depends on pandas (and thus numpy) internally. also, Ibis was created by the creator of pandas, and the lead developer (and other contributors) have also worked on pandas. the Ibis team understands pandas and the broader Python data ecosystem very well

> It's just tabular data. So what if I have to wait a few more milliseconds to get my result.

usually as scale grows, milliseconds -> seconds -> minutes -> hours -> days. at some point along the way having a dataframe library that can scale up without rewriting your code might be useful. but if you're dealing with small tabular data and pandas meets your needs, it's a great library to use and stick with!


Sounds like a solid plan. No reason to incur switching costs if things are working!

I think you're misunderstanding what "removing pandas" means.

You can still compute on DataFrames, it'll just be with something other than Pandas itself, probably DuckDB, Polars, or DataFusion.

So, the bridge was there, is there, and isn't going anywhere.


> I think you're misunderstanding what "removing pandas" means.

Maybe it’s in the messaging, like the “Farewell Pandas” part?


Feel free to put up a pull request with an alternate title that you feel communicates the message in a more accurate way!


Pandas made me think I hated python.


Funny, the same) When I was switching from R (data.table) to Python, it was painful. Not only because it was slow, but because of the API. At that time, I thought that maybe it's because of switching to something new. Several years later, switching from Pandas to Polars API was a real joy (Ibis is also good is that sense). So, I learned that it had been Pandas fault all along))


Its great to have a single entrypoint for multiple backends. What I am trying to understand and couldn't find much information related to: How does the use of multiple engines in Ibis impact the consistency of results for the same input and query, particularly in relation to semantic differences among the engines?


Frankly, pandas/dask/polars - all are trying to recreate something that has existed for years (ie. sql, spark) and with a terrible API.

I truly thought I was terrible at python for a long time because of pandas - turns out it just has an absolutely terrible API surface


I would not say pandas is a recreation of SQL/Spark, they have very different use-cases in my experience. SQL/Spark is like a bulk data management tool: I use it if I need to load massive data from some remote store, perform light preprocessing, join up a couple dimension tables, etc. Having normalized and joined my data, then pandas enters as a 'last-mile' processing engine, particularly when paired with ex. SKLearn or whatever other inference lib you're using. Pandas is awesome if you need to ex. apply string manipulation to a data table or daisy-chain some complicated computations together. Honestly in my opinion the API is really nice, I came over from R tidyverse and the 'chained methods' approach to pandas let's me carry over all my old patterns and paradigms. I find it far easier to use that approach than having to write like 20 dependent subqueries or staging tables


true spark has existed for years and is a great toolset.. i use it ever day. it's also a huge hassle spinning clusters up and down and configuration is complex.

I can execute some pretty hairy scans against a huge s3 parquet dataset in Duckdb that I would typically have to run in either spark or athena.. it's a little slower, but not ridiculously slower. And, it does all of that from my desktop.. no clusters, no mem or task configs.. just run the query. Being able to integrate all of the expensive historical scanning and knitting that back into an ML pipeline with desktop python is pretty nice.


How does ibis manage ordering of the data frame for SQL based engine?

One big gap that I noticed is that pandas retain ordering between operations, bust most SQL based engine doesn’t give you that guarantee. Looking at duckdb, it looks they only do that for csv.


this is a huge difference between Ibis (and really, any dataframe library that works with larger-than-RAM or distributed engines) and pandas. PySpark, Polars, and Snowpark all to some extent also do not implicitly order your data

so in Ibis, you need to explicitly manage order with an `order_by` call at the end. there are a lot of technical reasons for why maintaining order is inefficient as you scale up


curious to know what they use instead of Dask ?


DuckDB. It's faster on a single computer than Dask.


Ibis still supports over a dozen backends, including DuckDB and DataFusion and Polars (all great single-node OLAP query engine options that are often faster than trying something distributed) and PySpark, Snowflake, BigQuery, Trino, ClickHouse, Druid, and more

plenty of options for fast and scalable computation. in the fullness of time, it would be great to have a Dask backend but it'd need to be rewritten to use dask-expr, which isn't something the core Ibis team has the capacity for


As a potential user of Ibis, how do I know they won't eventually drop support for other backends until it just becomes a frontend for DuckDB?


Great question, a bunch of reasons:

1. The value prop of Ibis is to avoid vendor lock-in and allow users to more easily swap out execution engines that meet their needs. So it's valuable to the project to have "competing" backends supported (e.g. DuckDB, DataFusion, and Polars for local, and Snowflake, ClickHouse, PySpark, Databricks, etc. for distributed) -- the idea is a standard like Apache Arrow that everyone adopts to stop this cycle of tightly coupling a dataframe interface to an execution engine 2. Ibis is primarily supported by Voltron Data (my employer) which is itself a vendor, so the real worry would be that Ibis drops all backends until it only support Voltron Data's engine (Theseus)... 3. ... but Voltron Data cannot do that for a number of reasons. First, Ibis is an independently governed open source project. Second, it would not make any business sense. You can read some more on that here: https://ibis-project.org/posts/why-voda-supports-ibis.

The general idea is to add more backends, not remove them, but pandas is a special case where it introduces a lot of technical complexity with (arguably) negative benefit to potential users -- the DuckDB backend can do everything the pandas backend can do, but better


> the DuckDB backend can do everything the pandas backend can do, but better

Maybe this should be a focus. Putting together killer demos that show this to people. Because I don’t see it.

Maybe I’m thinking about it in the wrong terms. I think of DuckDB in terms of what we’re doing with pandas (which isn’t perfect but works), and also as an embedded database like a SQLite alternative.

In both aspects it seems like we have more friction working in DuckDB than in pandas and SQLite. Most common problem is getting errors about the DuckDB file being locked or in use, by the process that’s trying to read from it (which just spawned). I thought maybe it was me, but even Jetbrains tooling had the same weird issues with DuckDB.

Compared to SQLite, which just works, with virtually any language or tooling, I don’t see the benefit to DuckDB. I see some people say it’s better and more efficient for larger data, but I couldn’t get it to work reliably with tiny data on a single machine.


I use both DuckDB and SQLite at work. SQLite is better when the database gets lots of little writes, when the database is going to be stored long term as the primary source of truth, when you don't need to do lots of complicated analytical queries etc. Very useful for storing both real and simulated data long term.

DuckDB is much nicer in terms of types, built in functions, and syntax extensions for analytical queries and also happens to be faster for big analytical queries although most of our data is small enough that it doesn't make a big difference over SQLite (although it still is a big improvement over using Pandas even for small data). DuckDB only just released versions with backwards compatible data storage though, so we don't yet use it as a source of truth. DuckDB has really improved in general over the last couple years as well and finally hit 1.0 3 months ago. So depending what version you tried it out on tiny data, it may be better now. It's also possible to use DuckDB to read and write from SQLite if you're concerned about interop and long term storage although I haven't done that myself so don't know what the rough edges are.


I think you're trying to use DuckDB incorrectly. If you have a writer connection open to the DB, you can't open reader connections as well (including from IntelliJ). You'll get file lock errors, as you noticed. You can either have a single writer connection at once, or multiple reader connections at once. DuckDB is significantly better than SQLite for analytical queries on big data sets, but it's not a replacement for a transactional database such as SQLite.


> how do I know…

You don’t. Follow the market share. That’s as the close as you’ll get to a guarantee.

Ibis announcing they’re dropping support for pandas would be like Astro announcing they’re dropping support for React.

It only means fewer people adopt the new thing, not that the thing with the most market share dies off (looking at you, PHP and jQuery).


> Ibis announcing they’re dropping support for pandas would be like Astro announcing they’re dropping support for React.

This is extremely misleading and a false analogy. Read the blog post.

If you still come to the same conclusion after reading the blog post, feel free to reach out on Zulip, more than happy to chat about it.


even vanilla is faster than dask on a single computer ... _OFFICIALLY_ ... cause dask is and only is for clusters


if you’re at cluster scale, i would skip right over dask and go to spark


or hadoop


I'll also suggest inscribing stone tablets with a chunked version of your computation and distributing the described chunked computation to your friends, and then gathering the tablets and finishing up the compute.

It's scalable to ~billions of nodes.


but that exists already and it's called databricks


can you expand or give more context on this ?


Dask is designed for scaling to multiple compute nodes, and is often slower than vanilla pandas on a single machine because of the overhead.

If your data is larger than memory though, at a certain point Dask is faster than pandas (which will often just fail), and DuckDB (Which is super fast for in-memory data, and still quite fast for out of memory data).

Rule of thumb these days though is that if your data isn't in the terabytes per run, a single processing node is probably best.


Dask was really cool back in the day. Farewell, friend.


Back in the day? What are people using now instead?


Ray seems to be a popular choice now if you need multi node. But Polars is really nice if you're okay with just vertically scaling on one machine.


I've been working with pencil and paper recently. Thinking about starting an Ibis backend for it.


Abacuses?


Man, these frontend people are constantly churning through dependencies, a new framework every year. Don't reinvent the wheel, just be like backend folks and pick a dependency and stick with it. Oh wait...


ETL and data analysis “notebook” world been warming up for years and now it’s red-hot due to every company down to Grandma’s Cookie Bakery LLC wanting to “do AI”. And most of the standard tools are surprisingly terrible so churn is increasing for reasons similar to why that happened in frontend (the underlying tools are ass and the existing interfaces bad and almost all of it’s being overcomplicated for bad reasons and a million more developers just jumped into the pool)


Yes, and so really what we are seeing is just the effects of a high quality bar. If you want high-quality software, you do more iterations. And iterations produce breaking changes.


Mmm. A lot of it’s motivated by pain, with the output filtered through a web of organizational friction, principal agent problems, self-promotional nonsense, and other harmful factors. If the web’s any indication, there’s no guarantee this activity is efficient at getting us closer to meeting a high quality bar.


Just as an FYI Ibis is being around for 10 years approximately...


"TL; DR: we are deprecating the pandas and dask backends and will be removing them in version 10.0.

There is no feature gap between the pandas backend and our default DuckDB backend"

There is an important feature gap, pandas has all the users and no one knows you guys. Was trying to pitch this to my company but I guess its goodbye Ibis and thanks for all the fish.


were you using the pandas backend in Ibis and pitching that to your company? you'd have much better performance and efficiency using the default backend, even if you used pandas for data input and output (though the DuckDB data readers are generally more efficient too). I'd be happy to help you on the pitch!

and if you were using the pandas backend or know of many users doing so, please reach out! we love feedback from our users and would be sad to see y'all go, we're open to reconsidering any decisions based on feedback


Yes I agree and have tested it, my point is that marketing wise it is a shame. I will even think that when looking into new tooling "pandae" compatibility as a search term and incentive would be far ahead the word "dataframe"


Pandas compatibility as input and output will remain. Not sure what's in doubt here.


we will take this excellent marketing advise under advisement, thank you for your consideration


you could completely respect the decision to deprecate some support, but the marketing speak comes across distasteful.

from a user's perspective, selecting a tool should be about what's "best" for you and not what is the "best" out there (from some arbitrary measure nonetheless).

pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm. using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.

there are plenty players in the space going about this from the stack. would be nice to see more discussion on alternative generic data processing paradigms instead.


this is a blog for the Ibis project, directly from one of the main developers/maintainers of the project, about deprecating two (of 20+) backends the project supports and giving the justification

as a project, we're big on using the right tool for the job -- of course we often think that's Ibis. a lot of effort has been put into the ergonomics, taking learnings from pandas and dplyr and SQL and various other Python data projects the team has been instrumental in. if you weren't aware, the creator of pandas also created Ibis and the current lead developer (Phillip Cloud) was also very involved in pandas. we have plenty of other blogs and things on the documentation about the project's ergonomics and thoughts on data processing paradigms

this specific blog isn't intended as "marketing speak" or to somehow be distasteful, it's intended to inform the Ibis community that we're deprecating a backend and justifying why


Pandas support as input and output isn't going anywhere. People can continue to bring their pandas DataFrames to Ibis and run Ibis expressions against them.

> pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm.

Yes, and nothing is different about the paradigm after the changes outlined in the blog post, it's still a dataframe paradigm.

> using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.

This seems to indicate you _do_ understand that the paradigm isn't changing but then does a complete 180 into complaining about Ibis not solving pandas ergonomics issues. Which is it?


Related to this pandas?: https://pandas.pydata.org/


On the contrary, very related to that pandas!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: