Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A Python container for dataclasses with multi-indexing and vector opps (github.com/joshlk)
122 points by joshlk on Nov 2, 2020 | hide | past | favorite | 14 comments



I came to the idea as I wanted to use the ergonomics and efficiency of Pandas DataFrames but realised it makes messy production code thats hard to maintain. This package provides the best of both worlds.

Description: A Python dataclass container with multi-indexing and bulk operations. Provides the typed benefits and ergonomics of dataclasses while having the efficiency of Pandas dataframes.

The container is based on data-oriented design by optimising the memory layout of the stored data, providing fast bulk operations and a smaller memory footprint for large collections. Bulk operations are enabled using Pandas which has a rich set of vectorised methods for both numerical and string data types.


Looks interesting; thanks for writing this!

Does it support nested structs? Sometimes that's more natural than a flat struct.


Yes - you can use nested dataclasses. But you will loose some of the nice features


Meta: that's a very random backslash in the title, very confusing in my opinion.

It's just an abbreviated "with" so should be a forward slash ("w/") or, better imo, written out in full.

EDIT: And also "opps" isn't a very good way to abbreviate "operations", is it?


This appears to be encroaching on Apache Arrow's territory (which notably is headed by the creator of Pandas): https://arrow.apache.org/


mmm, i wouldn't say that. arrow is really meant as a lower-level utility with usability abstractions layered on top. for low-level tasks, like serializing/deserializing, arrow is amazing, but if you want to actually explore or transform data, arrow's not really usable (based on about a day of trying).

this, on the other hand, looks like something i might use to actually play with data and allow others to follow along.


One of the reasons this is interesting to me is that we quickly hit one of MLflow's limitations[0] in our machine learning platform[1]. Users collaborate in near real-time on notebooks, schedule long-running notebooks, and we automatically detect user's models and then save them so they don't have to. They can then deploy them in one click. However, MLflow has trouble with models requiring high-dimensional inputs, which is most non toy models I've seen.

The usual "solution" is to write custom wrapping code for this because it only supports 2D DataFrames, which is unacceptable for us because that would mean users would have to do it, so we'll take care of this too.

- [0]: https://github.com/mlflow/mlflow/issues/3570

- [1]: https://iko.ai


I don't generally think that dataframes as a data interchange format are a good idea if you're looking to build long-term maintainable code. Their structure is too mutable – in my experience, they lead to kitchen sink APIs where the easiest way to add a feature is to shove more columns into the dataframe.


This is what this package aims to prevent. Providing rigidity to the data structure while giving the performance and API benefits of Pandas


Excellent, I've been adding lots of dataclasses to a project that has some pandas parts. Love the additional productivity from pyright type checking and found the type ambiguity of Dataframes annoying.


This is nice!

I just implemented a pandas-like api for one of my projects, but I wound up building it on top of sortedcontainers: https://paramtools.dev/api/viewing-data.html

I almost swapped over to using Pandas Series like what you did, but I went with something (I think) is lighter weight.

It's cool to see what an alternative implementation might have looked like!


Very cool, seems like something I wanted to implement for my life dashboard (health/sleep/exercise/etc) https://github.com/karlicoss/dashboard

I'm heavily relying on pandas frames for manipulating the data, and have been thinking of ways to make it type safe[r], will give your library a try!


I don't suppose there's type checking on indexing via fields... IE df.at(field_value) typechecks no matter what type the first field of the record class is.


This wasn't implemented as I think it would be very hard to do/impossible in Python.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: