However, Prosto allows for data processing via column operations in many tables (implemented as pandas data frames) by providing a column-oriented equivalents for joins and groupby (hence it has no joins and no groupbys which are known to be quite difficult and require high expertise).
Prosto also provides Column-SQL which might be simpler and more natural in many use cases.
The whole approach is based on the concept-oriented model of data which makes functions first-class elements of the model as opposed to having only sets in the relational model.
Yes I can see how you can make that leap. Right now Hamilton isn't about scaling computation per se, more so about scaling a code base to handle featurization and people using that code base.
Hamilton wouldn't replace Dask. Instead Hamilton could use Dask, we even have a prototype PR to do so.
I haven't evaluated how Hamilton is implemented specifically but:
* It's solving a really different problem than Spark/Dask/etc. It could definitely use those tools but it's just not the same thing.
* If you're looking at this and thinking it's useless even if you're familiar with Pandas/dataframes it's probably just that you haven't had to work on the types of problems that this particular tool is intended to help with.
I should add I really like the general idea of what you're trying to do. I actually worked on some similar ideas for similar reasons (I think) a few years back.
It was in Matlab* though and not as slick as what you've created.
Is the idea with Hamilton to limit it specifically to columnar operations? e.g. no functions that filter the DataFrame? I think that's one thing that I attempted to do that ended up overcomplicating the framework I created.
So it's like spark for pandas? Seems like it might be better just to use spark and if there are features missing, build a framework to add in missing features on top of that - in which case you get a giant distributed processing engine for free with it. Would be interested to know if that was a consideration or not.
Hamilton seems to be a framework for expressing all of your feature extractions and transformations in such a way that they can be encapsulated and organized into functions and modules. These functions can then be composed into a transformation pipeline instructions. Hamilton can then inspect this composition of functions and understand all of the inter-dependencies, effectively constructing a DAG that can be used to visualize what exactly the pipeline is doing while also creating some feature/column lineage. This DAG can also be used to figure out what functionality can be pruned from your modules if it is never used. Unit tests written easily.
There is nothing within Hamilton that dictates which data-processing engine should be used to do the work. In fact they mention it in their blog that Spark could indeed be used. But I imagine anyone could come along and write an integration for Pandas (if it doesn't already exist), BigQuery, Snowflake, Materialize, etc...
This creates an abstraction between what you're trying to do, and what data processing engine is used to do it. Not to mention being able to create some fantastic visualizations (i assume) of what is actually going on in your transformation or feature-extraction pipeline.
> These functions can then be composed into a transformation pipeline instructions. Hamilton can then inspect this composition of functions and understand all of the inter-dependencies, effectively constructing a DAG that can be used to visualize what exactly the pipeline is doing
It isn't immediately obvious but that is exactly what Spark does - it builds a DAG out of your operations and only lazily executes what it has to when asked for a result. But on top of that, a tremendous amount of work has gone into auto-optimising that DAG execution so it can be very very smart while allowing relatively natural expression of queries and transformations at the top level. This is all going to have to be re-invented if you try and replace the DAG front end and just use spark's execution back end.
> This is all going to have to be re-invented if you try and replace the DAG front end and just use spark's execution back end.
Again, I don't know enough about Hamilton, but I don't think it's a matter of replacing anything Spark does, or any other data execution runtime for that matter. It's just another layer, another abstraction.
Hamilton -> Spark SQL/API -> Spark artifact -> Spark execution
Hamilton -> Snowflake SQL -> Snowflake execution
etc...
Could someone please help me understand what a "dataframe" is? I see this term thrown around occasionally, but failed to find a definition/explanation for someone who doesn't actually already know what it is :(
> Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
It’s a special data structure that holds a 2 dimensional array and allows you to perform very efficient “vector” (all rows/all columns) operations on it.
It’s used a lot in data science applications, particularly for matrix math.
It's a table. Think of a table from a database or a 2-dimensional csv file that exists as an object in your programming language of choice. The exact implementation is different in different languages and libraries. In Python the most popular one is the DataFrame class from the pandas library, R has data.frame built in, Julia has DataFrames.jl and there are other implementations too - spark, dask, data.table, tibble etc.
The easiest way for this is to click is open up Python or R and play around with a dataframe, it really isn't that complicated.
A dataframe is usually a list of columns. Columns are usually typed arrays that have a name. It's a bit like a struct of arrays (https://en.wikipedia.org/wiki/AoS_and_SoA) except that you can usually easily transform the dataframe (add columns, split columns, stuff like that).
This reminds me a bit of a Clojure library called Plumbing (formerly Graph): https://github.com/plumatic/plumbing. It also let you create a DAG for structured computation. It was used for a web service, at that time.
My pynto https://github.com/punkbrwstr/pynto is a similar framework for creating dataframes, but using a concatenative paradigm that treats the frame as a stack of columns. Functions ("words") operate on the stack to set up the graph for each column, and execution happens afterwards in parallel. Instead of function modifiers like @does it uses combinators to apply quoted operations to multiple columns. The postfix syntax (think postscript or factor) is unambiguous, if a bit old-school.
I'm curious what the average sentiment from the Stitch data team is on this. I see the small marginal utility of this, but this would be a massive pain to implement. Imagine going back thru thousands of lines of transforms and adding the framework. Say you make a couple small mistakes somewhere because the framework is new to you. Things seem fine at first, but weeks go by and something seems off. How do you find those mistakes?
Newly hired data scientists would have a "wtf is this thing?" response. You'd really need to "sell" people on this and it doesn't seem worth it.
Speaking as an author (so some bias), but you certainly make an interesting point about the cost of adoption. Its a high barrier to entry (requires a rewrite) for migration, but in our experience it actually is worth it.
So yeah, not for every problem (no framework is), but the problems that it solved it solved quite adeptly, and it was worth the cost (both initial and ongoing) of migration. It particularly proved its worth in the case where pipelines have grown in scale and complexity over time and thus gotten out of control. Particularly when there are thousands of lines of transforms -- that's when the ongoing maintenance burden gets so high that its completely worth the cost of migration. "Weeks going by and something seems off" we found often described the day to day of teams that manage codebases like that -- requiring them to spend tons of time working on the software engineering aspects of testing/visibility rather than developing new features to help the business.
Re: new data scientists, you'd be surprised at how adept they are at picking things up. There are a million frameworks/paradigms to learn when you join a company, so this isn't really an additional burden. And, once they learned that "one simple trick" (that parameter names map to the producing function), it ends up being pretty easy to use.
This is neat for toy problems but I don't see it working well for "real" pipelines. The magical DAG creation is going to be super hard to wrap your head around and even worse to debug.
This reminds me of an internal Google tool for doing async programming in Java (ProducerGraph or something). The idea was that you'd just write annotated functions and the framework would handle all the async stuff. Wasted many thousands of engineering hours while giving an even worse experience than manipulating futures directly.
One of the authors here — I think that systems that produce things magically often run into this burden. That doesn’t mean that the magic is bad, per se, rather that the thought put into operationalizing it (the user experience) is often << than that put into executing it (the engine).
In our case there’s very little magic, and a dead simple engine (so far). We (from a DS perspective) have found that expressing transforms in this manner simplifies the code making it well worth the added layer of abstraction. As a platform team we’ve focused on debugging (viz, etc…), but this is part of the reason we open sourced it! The more voices we get the more operational concerns we can iron out to avoid painful debugging experiences.
A lot of these libraries also suffer from really ugly syntax, vs e.g. my library which overloads operators and attempts to "transparently" plug in to python's native syntactic sugar https://github.com/timkpaine/tributary
Looks like an interesting (& powerful) library - what problem is it trying to solve exactly? That wasn't clear from the README (the greek's library doesn't shed any more light on things either).
I could be mistaken, but I think it's aimed at domains where you want a reactive experience similar to what you might get if you build a complicated Excel spreadsheet.
In a spreadsheet you're basically building a DAG (assuming you're using it right) that automatically and efficiently recalculates the downstream nodes whenever you change any of the input nodes.
I think it's actually a pretty hard experience to recreate in many programming languages. I think in Python the thing that comes closest these days is Streamlit, but it can still be a lot slower to put together something than a really fast Excel jockey.
Suppose you had a HamiltonFrame that knew the graph that generated it and knew which columns were inputs and which were outputs. When you update any of the input values it could automatically recalculate any columns downstream of the modified inputs.
I think the README needs something to explain what you get when you write these functions. Apparently Hamilton then creates a DAG... OK, and what does that do for me?
I think the point is that it doesn't do anything (data processing). Instead it's a way to create a set of directives about how to do something, and at the same time understand the inter-dependencies of those directives.
The blog post gets into the problems created by doing "something simple".
Thanks, the more I understand this, the less I like it.
I'm still not seeing the usefulness but I see a lot of potential problems and complications from redesigning your codebase like this. I work with pandas dataframes all the time. It really looks like they're forcing a functional paradigm in a place where it doesn't fit well.
You'll have to excuse my ignorance about Pandas...
But I believe what this would enable you to do is continue writing your data processing pipelines and then have them execute in Pandas (assuming there is a Pandas implementation) while you're working locally.
When it comes time to deploy your transformations to a production environment that uses Spark, Snowflake, etc... for its data processing engine (assuming integrations exist), it should work just the same way.
Keep in mind, that these are instructions that are being deployed. The power here is that the instructions are pushed down into the data processing engine. Data is not exhumed into the dataframe runtime! Hamilton functions can be "compiled" into data-engine runtime code (pandas, SparkSQL/SparkDF, SQL, etc).
Then on top of all of this, you (and more importantly your teammates and/or replacements) will have a better visualization of what your pipeline is doing. You'll also have a better idea of what unused functionality you can prune out of your code-base. Better unit testing. etc...