I came to the idea as I wanted to use the ergonomics and efficiency of Pandas DataFrames but realised it makes messy production code thats hard to maintain. This package provides the best of both worlds.
Description:
A Python dataclass container with multi-indexing and bulk operations. Provides the typed benefits and ergonomics of dataclasses while having the efficiency of Pandas dataframes.
The container is based on data-oriented design by optimising the memory layout of the stored data, providing fast bulk operations and a smaller memory footprint for large collections. Bulk operations are enabled using Pandas which has a rich set of vectorised methods for both numerical and string data types.
mmm, i wouldn't say that. arrow is really meant as a lower-level utility with usability abstractions layered on top. for low-level tasks, like serializing/deserializing, arrow is amazing, but if you want to actually explore or transform data, arrow's not really usable (based on about a day of trying).
this, on the other hand, looks like something i might use to actually play with data and allow others to follow along.
One of the reasons this is interesting to me is that we quickly hit one of MLflow's limitations[0] in our machine learning platform[1]. Users collaborate in near real-time on notebooks, schedule long-running notebooks, and we automatically detect user's models and then save them so they don't have to. They can then deploy them in one click. However, MLflow has trouble with models requiring high-dimensional inputs, which is most non toy models I've seen.
The usual "solution" is to write custom wrapping code for this because it only supports 2D DataFrames, which is unacceptable for us because that would mean users would have to do it, so we'll take care of this too.
I don't generally think that dataframes as a data interchange format are a good idea if you're looking to build long-term maintainable code. Their structure is too mutable – in my experience, they lead to kitchen sink APIs where the easiest way to add a feature is to shove more columns into the dataframe.
Excellent, I've been adding lots of dataclasses to a project that has some pandas parts. Love the additional productivity from pyright type checking and found the type ambiguity of Dataframes annoying.
I don't suppose there's type checking on indexing via fields... IE df.at(field_value) typechecks no matter what type the first field of the record class is.
Description: A Python dataclass container with multi-indexing and bulk operations. Provides the typed benefits and ergonomics of dataclasses while having the efficiency of Pandas dataframes.
The container is based on data-oriented design by optimising the memory layout of the stored data, providing fast bulk operations and a smaller memory footprint for large collections. Bulk operations are enabled using Pandas which has a rich set of vectorised methods for both numerical and string data types.