So it's a sort of an embedded relational database table, but live, and maybe with some bells and whistles ? Sounds like a triumph of marketing. What am I missing ?
But articles shouldn't have to teach readers all the background needed. Otherwise even a well written article would get a lot of bloat. Even linking other guides would require some context, and the total breadth of knowledge could be quite high. I could agree if there was to be a brief "areas you should know about" with a few keywords.
There are some areas where Hacker News might uniquely shine in answering questions, but "what are dataframes" isn't one of them. If someone sees "dataframe" and is confused, they have all the opportunity to search up "dataframes 101" and get up to speed.
“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”
“SQLite but only for Python and worse. Kind of.”
“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”
(I use this stuff daily)
[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”
Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.
I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software.
I haven't done the same types of things in Polars yet (simple selects).
I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.
Same with xarray datasets.
I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.
To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.
To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.
The whole thing though is just shockingly unhelpful and half-baked for something that taken over so completely in its niche. I guess my perspective is that of someone trying to build reliable automation, though—it’s probably really nice if you’re just noodlin’ in notebooks or the repl or whatever.
Multi indexes is baking my data into the horrible pandas object model into weird tuples. Every gripe I’ve had with pandas starts with trying to do something “simple” then following the pandas object model to achieve it and over complicating things. Polars is awesome, it fits my numpy understanding of dataframes as dict labeled arrays. I even like the immutability aspect.
Accessing an individual cell value is slightly clunky, but that's not really where you use a DataFrame. A DataFrame is an object for where the entirety of the dataset is under study. Where you are typically interested in the broad distributions contained within the data.
After you have highlighted trends (the majority of the work), then you might go spelunking at individual examples to see why something is funny.
> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.
But it is such a shame that us plebes can't learn about it to maybe find the hammer to this nail that is sticking up in our projects because the product can't tell people what it is. And it's only made worse by arrogant responses like yours.
It is a data structure used by data scientists. Polars is API compatible with the previous solution, but faster. If you don't analyze data, you can ignore it.
Guess that would help if I already knew what a dataframe is!