It provides a Python DAG building library like Airflow, but doesn't do Airflow's 'Operator ecosystem' thing.
It also is very opinionated about dependency management (Conda-only) and is Python-only, where Airflow I think has operators to run arbitrary containers. So Metaflow is a non-starter I think if you don't want to exclusively use Python.
Airflow also ships with built-in scheduler support (Celery?) or can run on K8s. Metaflow doesn't have this. Seems to rely on AWS Batch for production DAG execution.
Airflow ships with a pretty rich UI. Metaflow seems to be anti-UI, and provides a novel Notebook-oriented workflow interaction model.
Metaflow has pretty nice code artifact + params snapshotting functionality which is a core selling point. Airflow is not as supportive of this so it's harder to do reproducibility (I think). This is encapsulated by their "Datastore" model which can locally or in S3 persist flow code, config and data.
Metaflow does come bundled with a scheduler that can place jobs on a variety of compute platforms (current release supports local on-instance and AWS batch). In terms of dependencies, we went with conda because of its traction in the data science community as well as excellent support for system packages. Our execution model also supports arbitrary docker containers (on AWS batch) where you can theoretically bake in your own dependencies. In terms of language support, we have bindings for R internally, that we plan to open source as well.
I wouldn’t qualify metaflow as anti-UI. For model monitoring, we haven’t found a good enough UI that can handle the diversity of models and use cases we see internally, and believe that notebooks are an excellent visualisation medium that gives the power to the end user (data scientists) to craft dashboards as they see fit. For tracking the execution of production runs, we have historically relied on the UI of the scheduler itself (meson). We are exploring what a metaflow-specific UI might look like.
As for comparisons with Airflow, it is an excellent production grade scheduler. Metaflow intends to solve a different problem of providing an excellent development and deployment experience for ML pipelines.
Thanks for open sourcing this! This seems like precisely the kind of tool I‘ve been looking for. One thing that would be really great is support for HPC schedulers like SLURM. It seems relatively straightforward to add, so I might give it a shot myself.
How is this different / better to existing tools or workflows? I don't like to criticise new frameworks / tools without first understanding them, but I like to know what some key differences are without the marketing/PR fluff before giving one a go.
- Metaflow snapshots your code, data, and dependencies automatically in a content-addressed datastore, which is typically backed by S3, although local filesystem is supported too. This allows you to resume workflows, reproduce past results, and inspect anything about the workflow e.g. in a notebook. This is a core feature of Metaflow.
- Metaflow is designed to work well with a cloud backend. We support AWS today but technically other clouds could be supported too. There's quite a bit of engineering that has gone into building this integration. For instance, using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily.
- We have spent time and effort in keeping the API surface area clean and highly usable. YMMV but it has been an appealing feature to many users this far.
Your first bullet point should be highlighted on your project front page! I read over the page a couple of times, and couldn’t deduce that, and I think it’s a really enticing feature!
Dask is great if you want to distribute your algorithms / data processing at a granular level. Metaflow is a bit more "meta" in a sense that we take your Python function as-is, which may use e.g. Tensorflow and PyTorch, and we execute it as an atomic unit on a container. This is useful, since you can use any existing ML libraries.
Correct me if I'm wrong on the following, as I have not yet used Metaflow (only read the docs):
It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?
Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?
Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?
To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?
Yes - you should be able to use dask the way you say.
Your first part of the understanding matches my expectation too.
Dask single box parallelism achieved by multi processing - akin to parallel map.
And distributed compute is achieved by shipping the work to remote substrates.
For your second comment - we leverage pickle mostly to keep it easy and simple for basic types.
For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.
Our hope with metaflow is to make the transition to production schedulers like Airflow (and perhaps similar technologies) seamless once you write the DAG via the FlowSpec.
The user doesn’t have to care about the conversion to YAML etc.
So I would say metaflow works in tandem with existing schedulers.
I wouldn't exactly say that. Jupyter notebooks don't have an easy way to represent an arbitrary DAG. The flow is more linear and narrative like.
That said, we do expect metaflow (with client API) to play very well with notebooks to support a narrative from a business use-case pov; which might be the end-goal of most ML workloads (hopefully).
I would like to think of metaflow, as your workflow construct - hopefully making your life simpler with ML workloads when involving interactions with existing pieces of infrastructure (infra pieces - storage, compute, notebooks or other UI, http service hosting etc.; concepts - collaboration, versioning, archiving, dependency management)
the workflow DAG is a core concept of Metaflow but Metaflow is not just that. Other core parts are managing and inspecting state, cloud integration, and organizing results.
Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.
you need more connections than what a single AWS CLI process open to saturate network on a big box. You can achieve the same by doing some "xargs | aws cli" trickery but then error handling becomes harder.
Our S3 client just handles multiple worker processes correctly with error handling.
hey, I'm one of the authors of Metaflow. Happy to answer any questions! Netflix has been using Metaflow internally for about two years, so we have many war stories :)
Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.
I looked through the tutorial on my mobile and the answer was not immediately clear.
Is the benefit that it auto scales on AWS without having to think through the infrastructure?
Hi omarhaneef,
We don't intend to compete with Tensorflow, PyTorch, SKLearn. What we offer is a way to iterate and productionize your models written using any of the aforementioned libraries (and more). https://docs.metaflow.org/introduction/what-is-metaflow contains further elaboration of our philosophy. Auto-scaling infrastructure is one piece of the puzzle, and Metaflow goes beyond that offering a comprehensive solution for model management.
G'day, seems like an cool tool, thanks - the links to the github tuts are currently broken...
I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.
Hey I meant to track one of y'all down at the MLOps conference, but didn't get the chance. I've built a very shitty version of a cached-execution DAG thing internally, and one of the design decisions I made was to have it so that parent nodes in the DAG don't need to know anything about child nodes. This allows for larger DAG builders to be more easily subclassed.
MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.
I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.
I have been looking for something exactly like this, but I use GCP not AWS. Is there a way to deploy outside of AWS? What would be involved in getting it to work with a different cloud?
Hi Ville, I cannot express how happy I am that you open-sourced Metaflow. I have three questions:
1) Do you know ETA of R API release?
2) Would you recommend using both Metaflow nad MLFlow in projects? Could you please explain why yes or why no :)
3) Do you plan to release integration with Spark/Yarn?
yep, Metaflow tracks everything: Your code, dependencies, and the internal state of the workflow automatically.
A big difference between Metaflow and other workflow frameworks for ML is that Metaflow doesn't only execute your DAG, it helps you to design and implement the code that runs inside the DAG. Many frameworks leave these details to the data scientist to decide.
Thank you for sharing this. Would this be useful for me if I only need the deployment management part? Don't really need to track experiments, just looking for an easy way to deploy my models to Fargate.
No, it is not offtopic in the slightest...especially in response to an "author". The underlying questions remain obvious. Is metaflow not used in netflix suggestions? What utility does it offer that other tooling doesnt or at least how has netflix extracted value? These are interesting questions.
This looks exciting! I'll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.
1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?
2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.
3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?
4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.
1. Metaflow should best help when there is an element of collaboration - so small to medium team of data scientists. Collaborating with your self is also another scenario when Metaflow can be useful since it takes care of versioning and archiving various artifacts.
2. Keeping the language pythonic, without any additional need to learn a DSL has definitely been key to Metaflow's adoption internally. That said, this is something we are open to hearing back, esp. with this OSS launch.
3. Yes - definitely think so. Personally my favorite is the local prototyping experience part; when everything can fit in memory and is blazing fast.
There is an also an open issue for fast-data access, which you can upvote if interested in seeing it open-sourced.
4. We don't think there is an exact equivalent as well. :)
re: Kubeflow - imho it is quite coupled to Kubernetes. We don’t intend to be tied to a specific compute substrate even though the first launch is with AWS. We do follow a plugin architecture - so I’m hoping Kube happens sometime.
re: Flyte - I’m less informed on this but happy to educate myself and get back.
Good overview of Flyte found here. https://www.youtube.com/watch?v=KdUJGSP1h9U
It does appear to be quite similar, though it has native k8s integration and a central web-based UI for monitoring jobs. Flyte asks the user to turn on caching. I like that Metaflow does that for you by default.
That's true of Kubeflow. I'm not sure that project will be as keen on being as "compute substrate" agnostic as Metaflow too, given its connection with Google.
If you feel inclined jump in the Flyte Slack and share your thoughts :). At my company we're on Kubeflow/Argo now, but things are developing quite a lot in this space so keen to not be myopic.
Is there a reason to use this over DVC[1] which is language and framework agnostic and supports a large number of storage backends? It works with any git repo and even polyglot implementations and can run the DAG on any system.
Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."
We are good friends with the DVC folks! If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. The cloud integration is really important at Netflix's scale.
My team has a similar library called Loman, which we open-sourced. Instead of nodes representing tasks, they represent data, and the library keeps track of which nodes are up-to-date or stale as you provide new inputs or change how nodes are computed. Each node is either an input node with a provided value, or a computed node with a function to calculate its value. Think of it as a grown-up Excel calculation tree. We've found it quite useful for quant research, and in production it works nicely because you can serialize entire computation graph which gives an easy way to diagnose what failed and why in hundreds of interdependent computations. It's also useful for real-time displays, where you can bind market and UI inputs to nodes and calculated nodes back to the UI - some things you want to recalculate frequently, whereas some are slow and need to happen infrequently in the background.
I am disappointed that when I click on documentation, "why metaflow," I get a bunch of cartoony BS instead of a simple text explanation. Glad these folks don't write RFC'S.
Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.
We are on Azure using Spark via Databricks. We had to abandon sci kit learn because of this choice. Does your service require AWS and can it be used in conjunction with Spark? Thank you for your time and consideration.
btw, if you happen to be at AWS Reinvent right now, you can get a stylish, collector's edition Metaflow t-shirt if you drop by at the Netflix booth at the expo hall and/or ping us otherwise!
The fact that metaflow works directly in Python piques my interest. I can lint it, I can test it, I can format it, I can easily extend it.
I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.
Can anybody provide a good comparison e.g. with Meltano?
I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.
I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.
I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!
Seems like a cool addition to the DAG ML tooling family. Thanks for sharing! Do you support, or plan to support, features commonly found in data science platform tools like Domino (https://www.dominodatalab.com/)? I'm thinking of container management, automatic publishing of web apps and API endpoints, providing a search for artifacts like code or projects, etc.
- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)
Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org
Metaflow comes with a built in scheduler. If your company has an existing scheduler, e.g. for ETL, you can translate Metaflow DAGs to the production scheduler automatically. This is what we do at Netflix.
We could provide similar support e.g. for Airflow, if there's interest.
For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.
> For monitoring, we have relied on notebooks this far.
This is pretty interesting approach. Notebooks are a natural environment of Data Scientist and for monitoring they'd provide really good discoverability, flexibility, and interactivity.
Do you use more traditional monitoring that will alert you somehow if a workflow fails, or a workflow hasn't run at all for X hrs/days?
This isn't a scheduler that will trigger 'flows' though is it? If I search "Scheduler" in your docs the top result is a roadmap item, and searching "Cron" turns up nothing.
Is it true that right now that to run a DAG "every day at 3am UTC" requires an external service?
Hey tristanz - we had similar needs and built up a lightweight scheduling, monitoring, and data flow UI. Originally for Airflow but we're excited about metaflow and will be integrating. Shoot me a note if interested to see what we built
Very interesting project. I love that this allows you to transparently switch "runtime" from local to cloud, like spark does, but integrated with common python tools like sklearn/tf etc. Looking forward to test metaflow out myself.
i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model
Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)
After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.
Orchestrating a workflow, which is what Dagster does, is just one part of Metaflow. Other important parts are dependency management, cloud integration, state transfer, inspecting and organizing results - features that are central to data science workflows.
Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.
Here's my understanding:
- It's a python library for creating & executing DAGs
- Each node is a processing step & the results are stored after each step so you can restart failed workflows from where it failed
- Tight integration with AWS ECS to run the whole DAG on cloud
I don't know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.