Hacker News new | past | comments | ask | show | jobs | submit login

Here are some key features:

- Metaflow snapshots your code, data, and dependencies automatically in a content-addressed datastore, which is typically backed by S3, although local filesystem is supported too. This allows you to resume workflows, reproduce past results, and inspect anything about the workflow e.g. in a notebook. This is a core feature of Metaflow.

- Metaflow is designed to work well with a cloud backend. We support AWS today but technically other clouds could be supported too. There's quite a bit of engineering that has gone into building this integration. For instance, using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily.

- We have spent time and effort in keeping the API surface area clean and highly usable. YMMV but it has been an appealing feature to many users this far.

Hope this makes sense!




Your first bullet point should be highlighted on your project front page! I read over the page a couple of times, and couldn’t deduce that, and I think it’s a really enticing feature!


Thanks. Feedback noted.


Can you compare and contrast with tools such as dask, dask-kubernetes, perfect[1]?

[1] https://www.prefect.io/products/core


Dask is great if you want to distribute your algorithms / data processing at a granular level. Metaflow is a bit more "meta" in a sense that we take your Python function as-is, which may use e.g. Tensorflow and PyTorch, and we execute it as an atomic unit on a container. This is useful, since you can use any existing ML libraries.

I am not familiar with Prefect.


Correct me if I'm wrong on the following, as I have not yet used Metaflow (only read the docs):

It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?

Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?

Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?

To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?

As an aside: hot damn, the API - https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f... Really cool product. Thanks for all the effort poured into this by you and others.


Yes - you should be able to use dask the way you say.

Your first part of the understanding matches my expectation too. Dask single box parallelism achieved by multi processing - akin to parallel map. And distributed compute is achieved by shipping the work to remote substrates.

For your second comment - we leverage pickle mostly to keep it easy and simple for basic types. For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.


Yes, you can use any python library inside a metaflow step.


Prefect is built by the Airflow core devs after they took their initial learnings and built something new. It's a reasonable orchestration engine.


Our hope with metaflow is to make the transition to production schedulers like Airflow (and perhaps similar technologies) seamless once you write the DAG via the FlowSpec. The user doesn’t have to care about the conversion to YAML etc. So I would say metaflow works in tandem with existing schedulers.


This is very interesting as a goal. Are you saying metaflow is a "Jupyter notebook for airflow developers" kind of a thing ?


I wouldn't exactly say that. Jupyter notebooks don't have an easy way to represent an arbitrary DAG. The flow is more linear and narrative like. That said, we do expect metaflow (with client API) to play very well with notebooks to support a narrative from a business use-case pov; which might be the end-goal of most ML workloads (hopefully). I would like to think of metaflow, as your workflow construct - hopefully making your life simpler with ML workloads when involving interactions with existing pieces of infrastructure (infra pieces - storage, compute, notebooks or other UI, http service hosting etc.; concepts - collaboration, versioning, archiving, dependency management)


sorry- i meant notebook figuratively.

so metaflow is the local dev version of workflow construct. and then export that to airflow/etc compatible format ?

what workflow engine do you guys use and primarily support in metaflow ?


the workflow DAG is a core concept of Metaflow but Metaflow is not just that. Other core parts are managing and inspecting state, cloud integration, and organizing results.

At Netflix, we use an internal workflow engine called Meson https://www.youtube.com/watch?v=0R58_tx7azY


Correction: Prefect is built by one of the Airflow committers. At this time Airflow has ~50 committers. https://whimsy.apache.org/roster/ppmc/airflow


> It's a reasonable orchestration engine.

Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.


so happy to learn about it. i have used airflow in the past and it seems they have addressed various pain points with this new library.


Sorry to be snarky, but after all this great job, I still have to put the db password in clear text, as an environment variable...

Why not use secrets manager for this? It can even rotate the secret with not much headache.

/edit: I could have a wrapper script that reads the secret and then os.execve()...


Good point. We will address it.


> using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily

Can you please explain how you were able to better the performance of aws cli.


you need more connections than what a single AWS CLI process open to saturate network on a big box. You can achieve the same by doing some "xargs | aws cli" trickery but then error handling becomes harder.

Our S3 client just handles multiple worker processes correctly with error handling.



What does it mean to be able to pull over 10Gbps? With one object or many objects in the same prefix?


With many objects under the same S3 bucket - say for a flow or a run (with many tasks).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: