> NULL indicates a missing value, and NaN is Not a Number.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.
Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these
> 4. Doing missing data right
> All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.
> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.
> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.
Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.
Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?
> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.
I'm not sure I understand this. Can you explain it with an example?
What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.
One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.
We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.
In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).
In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.
> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.
I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.
Even medium tasks, the slow step is rarely computations, but the squishy human entering them.
When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.
Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.
> YAGNI - For lots of small data tasks, pandas is perfectly fine.
There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.
It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.
In my experience, Polars streaming runs out of memory at much smaller scales than both DuckDB and DataFusion and tends to use much more memory for the same workload when it doesn't outright segfault.
Polars is faster than those two once you get to less than a few GB, but beyond that you're better off with DuckDB or DataFusion.
I would love for this to improve in Polars, and I'm sure it will!
Do you mean segfault or OOM? I am not aware of Polars segfaulting on high memory pressure.
If it does segfault, would you mind opening an issue?
Some context; Polars is building a new streaming engine that will eventually be ready to run the whole Polars API (Also the hard stuff) in a streaming fashion. We expect the initial release end of this year/early next year.
Our in-memory engine isn't designed for out-of-core processing and thus if you benchmark it on restricted RAM, it will perform poorly as data is swapped or you go OOM. If you have a machine with enough RAM, Polars is very competitive in performance. And in our experience it is tough to beat in time-series/window functions.
Segmentation violations are often the result of different underlying problems, one of which can be running out of memory.
We (the Ibis team) have opened related issues and the usual response is to not use streaming until it's ready, or to fix the problem if it can be fixed.
Not sure what else there is to do, seems like things are working as expected/intended for the moment!
We'll definitely be the first to try out any improvements to the streaming engine.
Nice to see, over the past months I've replaces pandas with ibis in all new projects and I am a huge fan!
- Syntax in general feels more fluid than pandas
- Chaining operations with deferred expressions makes code snippets very portable
- Duckdb backend is super fast
- Community is very active, friendly and responsive
I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)
Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category.
Multi-indexes definitely have their place. In fact, I got involved in pandas development in 2013 as part of some work I was doing in graduate school, and I was a heavy user of multi-indexed columns. I loved them.
Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.
We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.
> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning
I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.
Complicated code is (probabilistically) slow, buggy, infrequently updated code. By all means, if it looks like a good enough tool for the job (especially if the alternatives don't) then use it anyway, but that's slightly different from it not being your concern.
I've seen enough projects need "surprise" major revisions because some team tried to sneak a dataframe into a 10M QPS service that my default is keeping pandas far away from anything close to a user-facing product.
I've also seen costs balloon as the data's scale grows beyond what pandas can handle, but basically all the alternatives suck for myriad reasons, so I don't try to push "not pandas" in the data backend. People can figure out what works for themselves, and I kind of like just writing it from scratch in a performant language when I personally hit that bottleneck.
I work a lot with IoT data, where basically everything is multi-variate time-series from multiple devices (at different physical locations and logical groupings). Pandas multi index is very nice for this, at least having time+space in the index.
Sorry I don't know what to answer. I don't think what I do qualifies as "workload".
I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.
The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.
Polars is a great choice but, like pandas, locks you into its execution engine(s). Polars is a supported backend for Ibis, though depending on your use case DuckDB or DataFusion may scale better. we did some benchmarking recently: https://ibis-project.org/posts/1tbc/
Also depending on your use case, Ibis may not support the part of the Polars API you actually need
(to be clear, I'm a fan of "DuckDB via Ibis", I'm much less a fan of statements to the effect of "use Polars via Ibis instead of Polars directly", as I think they miss a lot :wink:)
I've been using pandas for over a decade and I'm very efficient with it.
The Modern Pandas blog series from Tom Augsburger is excellent, and the old Wes McKinney book (creator of pandas) gave a good glimpse into how to effectively use it.
I don't think many people learn it correctly so it gets used in a very inefficient and spaghetti script style.
It's definitely a tool that's showing it's age, but it's still a very effective swiss army knife for data processing.
I am glad we're getting better alternatives, Hadley Wickams tidyverse really showed what data processing can be.
I also recommend Matt Harrison's talks ("Idiomatic Pandas") and book ("Effective Pandas").
Once I adopted method chaining, a lot of the issues that I had with pandas in the past due to poor style (e.g. SettingWithCopyWarning) pretty much disappeared.
Just imagine how much more efficient you would be if you were using R's DataTable.
Look, I applaud your skill, but at some point even a master craftsman realizes that the swiss army knife may not be the best tool, and a leatherman offers certain advantages.
I really like R’s library and I’ll use them any chance I get (libraries like lmer are still orders of magnitude more efficient than the same model in Statsmodels).
From my experience the biggest impediment to using R in production is many orgs don’t have a blessed way to run it.
R is my favourite language for data processing, the manual section Computing on the Language[1]is why R is such an ergonomic tool. I had hoped Julia would catch up, but Julia’s macros are not comparable in their depth.
I think pandas is probably the data equivalent of editing files using default vim or processing data with awk.
As a joke, I wrote an Ibis backend (https://github.com/cpcloud/ibish) that processes expressions using shell commands strung together with named pipes. It supports joins using the coreutils join command, projections, filters and some aggregations with awk.
It's faster than pandas in some cases and folks should put it into production immediately!
I was not aware of that, the influence from the tidyverse is very clear
I’ve tried to make mockups of similarly inspired API and I think they did a great job with clever tricks like the `_` reference (does that clash with IPython/Jupyter’s use?).
I do wonder how they manage compatibility with so many backends, it seems like many features will be directly tied to your backend (e.g. Trino using Java Pattern for regex vs BigQuery using Re2) in hard to explain ways. But maybe that’s not a big concern, because very often you’re only going to be using a couple of backends.
I’ll have to try Ibis out for myself, it looks like it can unify a lot of the work I have to do. I’ve moved away from pandas for all but the last mile of computation, so this might be a direct good SQL replacement.
`_` definitely clashes with IPython and Jupyter's use of it.
However, if you import it from Ibis then it ceases to be used as "most recent result" and remains the ibis underscore object (unless of course you explicitly assign it to something else).
Regarding backend compatibility, there are definitely a few kinds of things that we don't currently abstract over. One is regular expression syntax and another is floating point math (e.g., various algebraic properties that are violated that result in slightly different outputs).
I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.
Although the Null handling seems very compelling, I guess it comes at a cost of incompatibility with existing libraries, otherwise Pandas would have implemented it as well?
If you mean whether I run it distributedly a la Spark then no. If you mean whether I test it on various machines with different RAM sizes then yes.
> I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.
Well, I care. Both pandas and polars are, to my view, single-machine dataframe library, so the memory and CPU constraints are rather stringent.
My comparison is based solely on my experience: reading csv files that are 20% to 50% the size of RAM, pandas takes (or errors out after) 2 to 10 minutes, while polars finishes in 20 seconds. Queries in pandas are almost always slower than polars.
But reading your comment, it seems you and I have different use cases for dataframe libraries, which is fine. I mostly use them for exploratory analysis, so the SQL api is not that much of a plus to me, but the performance is.
When using Pandas appropriately, that is with method chaining, lambda expressions (instead of intermediate assignments) and pyarrow datatypes, you also get much faster speed and null values handling.
This irritates me. What was the point of THEIR comment? To be cunty? It absolutely was NOT a productive comment. They were being an ass. And you’re defending them being an ass, asking me who I am like I’m somehow not allowed to point out someone being a cunt unless I have some type of status symbol of which you need to approve. At this point I can only fathom you’re a coworker or friend of theirs, hence the defense. Nothing else makes sense.
One thing I do like about pandas is it’s pretty extensible to columns of new types. Maybe I’m missing something, but does Polars allow this? Last time I checked there wasn’t a clear path forward.
The pandas support for extensions is very robust with extensive testing to make sure you can do all the data aggregations functionality you expect in and pandas column.
Never heard of ibis before but the front page give a pretty good overview of what it is to me at least. Looks like a dataframes api that can be executed on a variety of backend engines, both distributed and local.
Somewhat analogous to how the pandas api can be used in pyspark via pyspark pandas, but from the api -> implementation direction rather than (pandas) implementation -> api -> implementation maybe?
As far as I can tell, it is an ORM for data engineers. They can write Python code that gets translated to either SQL or some other language understood by the DB engine that actually runs it.
Personnally, I tend to use Pandas because it is integrated everywhere, because of the ecosystem that uses it. Let's say I want to read data from json file (csv file, python dict, etc.) and I want to plot it using plotly. If Ibis is compatible with whatever Pands dataframe is compatible with, then for most of my usage I don't really care much about the "backend".
In my experience, the best thing about Pandas is how much it made me appreciate using dplyr and the tidyverse. If it wasn't for Pandas, I may not be the avid R user I am today.
I will try it out but honestly the pain-point for me is the library api to pandas is not always intuitive/natural to how things should be in Python. The NaN/None bit is annoying but I find that to be a minor annoyance.
I've only heard about Ibis maybe three times in the past two years and I pay pretty close attention to the space. If Ibis moves away from pandas, then it just means that I am less likely to try Ibis because there is no bridge.
Sure, hip new frameworks are moving away from pandas/numpy, but I'll wait 5 years for the dust to settle here while the compatibility and edge cases sort themselves out. The pydata/numfocus ecosystem is extensive.
It's just tabular data. So what if I have to wait a few more milliseconds to get my result.
> If Ibis moves away from pandas, then it just means that I am less likely to try Ibis because there is no bridge.
the bridge is that Ibis accepts pandas as input and has a `to_pandas()` method as output
Ibis also still depends on pandas (and thus numpy) internally. also, Ibis was created by the creator of pandas, and the lead developer (and other contributors) have also worked on pandas. the Ibis team understands pandas and the broader Python data ecosystem very well
> It's just tabular data. So what if I have to wait a few more milliseconds to get my result.
usually as scale grows, milliseconds -> seconds -> minutes -> hours -> days. at some point along the way having a dataframe library that can scale up without rewriting your code might be useful. but if you're dealing with small tabular data and pandas meets your needs, it's a great library to use and stick with!
Funny, the same) When I was switching from R (data.table) to Python, it was painful. Not only because it was slow, but because of the API. At that time, I thought that maybe it's because of switching to something new. Several years later, switching from Pandas to Polars API was a real joy (Ibis is also good is that sense). So, I learned that it had been Pandas fault all along))
Its great to have a single entrypoint for multiple backends. What I am trying to understand and couldn't find much information related to: How does the use of multiple engines in Ibis impact the consistency of results for the same input and query, particularly in relation to semantic differences among the engines?
I would not say pandas is a recreation of SQL/Spark, they have very different use-cases in my experience. SQL/Spark is like a bulk data management tool: I use it if I need to load massive data from some remote store, perform light preprocessing, join up a couple dimension tables, etc. Having normalized and joined my data, then pandas enters as a 'last-mile' processing engine, particularly when paired with ex. SKLearn or whatever other inference lib you're using. Pandas is awesome if you need to ex. apply string manipulation to a data table or daisy-chain some complicated computations together. Honestly in my opinion the API is really nice, I came over from R tidyverse and the 'chained methods' approach to pandas let's me carry over all my old patterns and paradigms. I find it far easier to use that approach than having to write like 20 dependent subqueries or staging tables
true spark has existed for years and is a great toolset.. i use it ever day. it's also a huge hassle spinning clusters up and down and configuration is complex.
I can execute some pretty hairy scans against a huge s3 parquet dataset in Duckdb that I would typically have to run in either spark or athena.. it's a little slower, but not ridiculously slower. And, it does all of that from my desktop.. no clusters, no mem or task configs.. just run the query. Being able to integrate all of the expensive historical scanning and knitting that back into an ML pipeline with desktop python is pretty nice.
How does ibis manage ordering of the data frame for SQL based engine?
One big gap that I noticed is that pandas retain ordering between operations, bust most SQL based engine doesn’t give you that guarantee. Looking at duckdb, it looks they only do that for csv.
this is a huge difference between Ibis (and really, any dataframe library that works with larger-than-RAM or distributed engines) and pandas. PySpark, Polars, and Snowpark all to some extent also do not implicitly order your data
so in Ibis, you need to explicitly manage order with an `order_by` call at the end. there are a lot of technical reasons for why maintaining order is inefficient as you scale up
Ibis still supports over a dozen backends, including DuckDB and DataFusion and Polars (all great single-node OLAP query engine options that are often faster than trying something distributed) and PySpark, Snowflake, BigQuery, Trino, ClickHouse, Druid, and more
plenty of options for fast and scalable computation. in the fullness of time, it would be great to have a Dask backend but it'd need to be rewritten to use dask-expr, which isn't something the core Ibis team has the capacity for
1. The value prop of Ibis is to avoid vendor lock-in and allow users to more easily swap out execution engines that meet their needs. So it's valuable to the project to have "competing" backends supported (e.g. DuckDB, DataFusion, and Polars for local, and Snowflake, ClickHouse, PySpark, Databricks, etc. for distributed) -- the idea is a standard like Apache Arrow that everyone adopts to stop this cycle of tightly coupling a dataframe interface to an execution engine
2. Ibis is primarily supported by Voltron Data (my employer) which is itself a vendor, so the real worry would be that Ibis drops all backends until it only support Voltron Data's engine (Theseus)...
3. ... but Voltron Data cannot do that for a number of reasons. First, Ibis is an independently governed open source project. Second, it would not make any business sense. You can read some more on that here: https://ibis-project.org/posts/why-voda-supports-ibis.
The general idea is to add more backends, not remove them, but pandas is a special case where it introduces a lot of technical complexity with (arguably) negative benefit to potential users -- the DuckDB backend can do everything the pandas backend can do, but better
> the DuckDB backend can do everything the pandas backend can do, but better
Maybe this should be a focus. Putting together killer demos that show this to people. Because I don’t see it.
Maybe I’m thinking about it in the wrong terms. I think of DuckDB in terms of what we’re doing with pandas (which isn’t perfect but works), and also as an embedded database like a SQLite alternative.
In both aspects it seems like we have more friction working in DuckDB than in pandas and SQLite. Most common problem is getting errors about the DuckDB file being locked or in use, by the process that’s trying to read from it (which just spawned). I thought maybe it was me, but even Jetbrains tooling had the same weird issues with DuckDB.
Compared to SQLite, which just works, with virtually any language or tooling, I don’t see the benefit to DuckDB. I see some people say it’s better and more efficient for larger data, but I couldn’t get it to work reliably with tiny data on a single machine.
I use both DuckDB and SQLite at work. SQLite is better when the database gets lots of little writes, when the database is going to be stored long term as the primary source of truth, when you don't need to do lots of complicated analytical queries etc. Very useful for storing both real and simulated data long term.
DuckDB is much nicer in terms of types, built in functions, and syntax extensions for analytical queries and also happens to be faster for big analytical queries although most of our data is small enough that it doesn't make a big difference over SQLite (although it still is a big improvement over using Pandas even for small data). DuckDB only just released versions with backwards compatible data storage though, so we don't yet use it as a source of truth. DuckDB has really improved in general over the last couple years as well and finally hit 1.0 3 months ago. So depending what version you tried it out on tiny data, it may be better now. It's also possible to use DuckDB to read and write from SQLite if you're concerned about interop and long term storage although I haven't done that myself so don't know what the rough edges are.
I think you're trying to use DuckDB incorrectly. If you have a writer connection open to the DB, you can't open reader connections as well (including from IntelliJ). You'll get file lock errors, as you noticed. You can either have a single writer connection at once, or multiple reader connections at once. DuckDB is significantly better than SQLite for analytical queries on big data sets, but it's not a replacement for a transactional database such as SQLite.
I'll also suggest inscribing stone tablets with a chunked version of your computation and distributing the described chunked computation to your friends, and then gathering the tablets and finishing up the compute.
Dask is designed for scaling to multiple compute nodes, and is often slower than vanilla pandas on a single machine because of the overhead.
If your data is larger than memory though, at a certain point Dask is faster than pandas (which will often just fail), and DuckDB (Which is super fast for in-memory data, and still quite fast for out of memory data).
Rule of thumb these days though is that if your data isn't in the terabytes per run, a single processing node is probably best.
Man, these frontend people are constantly churning through dependencies, a new framework every year. Don't reinvent the wheel, just be like backend folks and pick a dependency and stick with it. Oh wait...
ETL and data analysis “notebook” world been warming up for years and now it’s red-hot due to every company down to Grandma’s Cookie Bakery LLC wanting to “do AI”. And most of the standard tools are surprisingly terrible so churn is increasing for reasons similar to why that happened in frontend (the underlying tools are ass and the existing interfaces bad and almost all of it’s being overcomplicated for bad reasons and a million more developers just jumped into the pool)
Yes, and so really what we are seeing is just the effects of a high quality bar. If you want high-quality software, you do more iterations. And iterations produce breaking changes.
Mmm. A lot of it’s motivated by pain, with the output filtered through a web of organizational friction, principal agent problems, self-promotional nonsense, and other harmful factors. If the web’s any indication, there’s no guarantee this activity is efficient at getting us closer to meeting a high quality bar.
"TL; DR: we are deprecating the pandas and dask backends and will be removing them in version 10.0.
There is no feature gap between the pandas backend and our default DuckDB backend"
There is an important feature gap, pandas has all the users and no one knows you guys. Was trying to pitch this to my company but I guess its goodbye Ibis and thanks for all the fish.
were you using the pandas backend in Ibis and pitching that to your company? you'd have much better performance and efficiency using the default backend, even if you used pandas for data input and output (though the DuckDB data readers are generally more efficient too). I'd be happy to help you on the pitch!
and if you were using the pandas backend or know of many users doing so, please reach out! we love feedback from our users and would be sad to see y'all go, we're open to reconsidering any decisions based on feedback
Yes I agree and have tested it, my point is that marketing wise it is a shame. I will even think that when looking into new tooling "pandae" compatibility as a search term and incentive would be far ahead the word "dataframe"
you could completely respect the decision to deprecate some support, but the marketing speak comes across distasteful.
from a user's perspective, selecting a tool should be about what's "best" for you and not what is the "best" out there (from some arbitrary measure nonetheless).
pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm. using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.
there are plenty players in the space going about this from the stack. would be nice to see more discussion on alternative generic data processing paradigms instead.
this is a blog for the Ibis project, directly from one of the main developers/maintainers of the project, about deprecating two (of 20+) backends the project supports and giving the justification
as a project, we're big on using the right tool for the job -- of course we often think that's Ibis. a lot of effort has been put into the ergonomics, taking learnings from pandas and dplyr and SQL and various other Python data projects the team has been instrumental in. if you weren't aware, the creator of pandas also created Ibis and the current lead developer (Phillip Cloud) was also very involved in pandas. we have plenty of other blogs and things on the documentation about the project's ergonomics and thoughts on data processing paradigms
this specific blog isn't intended as "marketing speak" or to somehow be distasteful, it's intended to inform the Ibis community that we're deprecating a backend and justifying why
Pandas support as input and output isn't going anywhere. People can continue to bring their pandas DataFrames to Ibis and run Ibis expressions against them.
> pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm.
Yes, and nothing is different about the paradigm after the changes outlined in the blog post, it's still a dataframe paradigm.
> using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.
This seems to indicate you _do_ understand that the paradigm isn't changing but then does a complete 180 into complaining about Ibis not solving pandas ergonomics issues. Which is it?
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.