Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So, what is Polars? A short description would be “a query engine with a DataFrame frontend”.

Guess that would help if I already knew what a dataframe is!



Logically its a SQL table or spreadsheet. Where each column can be of different types. A column of usernames (strings), their age (integers) etc.

Every row is a distinct entity. Every column is usually stored / treated as its own distinct array.


So it's a sort of an embedded relational database table, but live, and maybe with some bells and whistles ? Sounds like a triumph of marketing. What am I missing ?


An earlier thread on Polars had this as the #1 thread (so the same complaint) - https://news.ycombinator.com/item?id=38920043


Another bit of context: it's "polars" (like polar bears) because the popular python dataframe library is "pandas".

My first assumption was that it was somehow related to polar coordinates.


If you don't know what a dataframe is, it doesn't matter what Polars is then, not for you.


Dataframe is a potentially useful concept to even those who haven't heard about it. And Polars can be a reasonable introduction to it too.


With that attitude I'd never learn anything new


Go learn about Dataframes then. There's no reason an article like this should start with a Dataframes 101.


Would be quicker for you to just search for the word "dataframe" than to post on Hacker News that you don't know what a dataframe is.


It's useful feedback for whoever wrote the article, that there are terms they need to not take for granted.


But articles shouldn't have to teach readers all the background needed. Otherwise even a well written article would get a lot of bloat. Even linking other guides would require some context, and the total breadth of knowledge could be quite high. I could agree if there was to be a brief "areas you should know about" with a few keywords.

There are some areas where Hacker News might uniquely shine in answering questions, but "what are dataframes" isn't one of them. If someone sees "dataframe" and is confused, they have all the opportunity to search up "dataframes 101" and get up to speed.


“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”

“SQLite but only for Python and worse. Kind of.”

“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”

(I use this stuff daily)

[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”


Is DuckDB a better tool for your purposes?

Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.


I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).


Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.


Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.


LOL.

I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.

Same with xarray datasets.

I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.


To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.

To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.


Polars is at least more consistent & sensible.

The whole thing though is just shockingly unhelpful and half-baked for something that taken over so completely in its niche. I guess my perspective is that of someone trying to build reliable automation, though—it’s probably really nice if you’re just noodlin’ in notebooks or the repl or whatever.


What part of df.loc[index_val, column_val] is hard here?

Oh you meant multiindex... yeah, slicing multiindexes sucks :)


Multi indexes is baking my data into the horrible pandas object model into weird tuples. Every gripe I’ve had with pandas starts with trying to do something “simple” then following the pandas object model to achieve it and over complicating things. Polars is awesome, it fits my numpy understanding of dataframes as dict labeled arrays. I even like the immutability aspect.


Accessing an individual cell value is slightly clunky, but that's not really where you use a DataFrame. A DataFrame is an object for where the entirety of the dataset is under study. Where you are typically interested in the broad distributions contained within the data.

After you have highlighted trends (the majority of the work), then you might go spelunking at individual examples to see why something is funny.


i mean, if you are reading and writing csv then yeah, you've already fucked up.


CSV is a terrible format. But it is extensively used. See also:

Why isn’t there a decent file format for tabular data? https://news.ycombinator.com/item?id=31220841


parquet is perfectly fine

> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I can open parquet in excel



Polars allows to use SQL I think?


Eh, plenty of SQL-first analysts could find it helpful. Not just for people coming from pandas/r.


But it is such a shame that us plebes can't learn about it to maybe find the hammer to this nail that is sticking up in our projects because the product can't tell people what it is. And it's only made worse by arrogant responses like yours.


It is a data structure used by data scientists. Polars is API compatible with the previous solution, but faster. If you don't analyze data, you can ignore it.


Python framework for tables (arrays with column names).

You can query/filter/sort and pick /transform data on multiple directions (row/column)

It’s used for data mining, ML etc.

It’s basically like working with a spreadsheet/sql table


Note that dataframes aren't Python specific. R has it too, and is an inspiration for Pandas.


Wide-form data structure, often with an index or partition framework in place able to apply column- and row-transformations quickly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: