> So, what is Polars? A short description would be “a query engine with a DataFr...

softwaredoug · on Feb 13, 2024

Logically its a SQL table or spreadsheet. Where each column can be of different types. A column of usernames (strings), their age (integers) etc.

Every row is a distinct entity. Every column is usually stored / treated as its own distinct array.

euroderf · on Feb 14, 2024

So it's a sort of an embedded relational database table, but live, and maybe with some bells and whistles ? Sounds like a triumph of marketing. What am I missing ?

green0wl · on Feb 13, 2024

An earlier thread on Polars had this as the #1 thread (so the same complaint) - https://news.ycombinator.com/item?id=38920043

__MatrixMan__ · on Feb 14, 2024

Another bit of context: it's "polars" (like polar bears) because the popular python dataframe library is "pandas".

My first assumption was that it was somehow related to polar coordinates.

dist-epoch · on Feb 13, 2024

If you don't know what a dataframe is, it doesn't matter what Polars is then, not for you.

goku12 · on Feb 14, 2024

Dataframe is a potentially useful concept to even those who haven't heard about it. And Polars can be a reasonable introduction to it too.

369548684892826 · on Feb 13, 2024

With that attitude I'd never learn anything new

lopatin · on Feb 13, 2024

Go learn about Dataframes then. There's no reason an article like this should start with a Dataframes 101.

amanzi · on Feb 14, 2024

Would be quicker for you to just search for the word "dataframe" than to post on Hacker News that you don't know what a dataframe is.

stavros · on Feb 14, 2024

It's useful feedback for whoever wrote the article, that there are terms they need to not take for granted.

vacuity · on Feb 14, 2024

But articles shouldn't have to teach readers all the background needed. Otherwise even a well written article would get a lot of bloat. Even linking other guides would require some context, and the total breadth of knowledge could be quite high. I could agree if there was to be a brief "areas you should know about" with a few keywords.

There are some areas where Hacker News might uniquely shine in answering questions, but "what are dataframes" isn't one of them. If someone sees "dataframe" and is confused, they have all the opportunity to search up "dataframes 101" and get up to speed.

vundercind · on Feb 13, 2024

“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”

“SQLite but only for Python and worse. Kind of.”

“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”

(I use this stuff daily)

[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”

ericjmorey · on Feb 14, 2024

Is DuckDB a better tool for your purposes?

Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.

cmollis · on Feb 14, 2024

I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).

theLiminator · on Feb 17, 2024

Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.

cmollis · on Feb 14, 2024

Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.

dajt · on Feb 14, 2024

LOL.

I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.

Same with xarray datasets.

I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.

mmmmpancakes · on Feb 14, 2024

To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.

To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.

vundercind · on Feb 14, 2024

Polars is at least more consistent & sensible.

The whole thing though is just shockingly unhelpful and half-baked for something that taken over so completely in its niche. I guess my perspective is that of someone trying to build reliable automation, though—it’s probably really nice if you’re just noodlin’ in notebooks or the repl or whatever.

tomrod · on Feb 14, 2024

What part of df.loc[index_val, column_val] is hard here?

Oh you meant multiindex... yeah, slicing multiindexes sucks :)

ok_computer · on Feb 14, 2024

Multi indexes is baking my data into the horrible pandas object model into weird tuples. Every gripe I’ve had with pandas starts with trying to do something “simple” then following the pandas object model to achieve it and over complicating things. Polars is awesome, it fits my numpy understanding of dataframes as dict labeled arrays. I even like the immutability aspect.

fbdab103 · on Feb 14, 2024

Accessing an individual cell value is slightly clunky, but that's not really where you use a DataFrame. A DataFrame is an object for where the entirety of the dataset is under study. Where you are typically interested in the broad distributions contained within the data.

After you have highlighted trends (the majority of the work), then you might go spelunking at individual examples to see why something is funny.

mmmmpancakes · on Feb 14, 2024

i mean, if you are reading and writing csv then yeah, you've already fucked up.

hermitcrab · on Feb 14, 2024

CSV is a terrible format. But it is extensively used. See also:

Why isn’t there a decent file format for tabular data? https://news.ycombinator.com/item?id=31220841

mmmmpancakes · on Feb 14, 2024

parquet is perfectly fine

> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I can open parquet in excel

hermitcrab · on Feb 14, 2024

Using this connector? https://learn.microsoft.com/en-us/power-query/connectors/par...

riku_iki · on Feb 13, 2024

Polars allows to use SQL I think?

appplication · on Feb 13, 2024

Eh, plenty of SQL-first analysts could find it helpful. Not just for people coming from pandas/r.

dylan604 · on Feb 14, 2024

But it is such a shame that us plebes can't learn about it to maybe find the hammer to this nail that is sticking up in our projects because the product can't tell people what it is. And it's only made worse by arrogant responses like yours.

esafak · on Feb 13, 2024

It is a data structure used by data scientists. Polars is API compatible with the previous solution, but faster. If you don't analyze data, you can ignore it.

jbverschoor · on Feb 13, 2024

Python framework for tables (arrays with column names).

You can query/filter/sort and pick /transform data on multiple directions (row/column)

It’s used for data mining, ML etc.

It’s basically like working with a spreadsheet/sql table

goku12 · on Feb 14, 2024

Note that dataframes aren't Python specific. R has it too, and is an inspiration for Pandas.

tomrod · on Feb 14, 2024

Wide-form data structure, often with an index or partition framework in place able to apply column- and row-transformations quickly