Here are some key features: - Metaflow snapshots your code, data, and dependenci...

elamje · on Dec 4, 2019

Your first bullet point should be highlighted on your project front page! I read over the page a couple of times, and couldn’t deduce that, and I think it’s a really enticing feature!

savin-goyal · on Dec 4, 2019

Thanks. Feedback noted.

missosoup · on Dec 4, 2019

Can you compare and contrast with tools such as dask, dask-kubernetes, perfect[1]?

[1] https://www.prefect.io/products/core

vtuulos · on Dec 4, 2019

Dask is great if you want to distribute your algorithms / data processing at a granular level. Metaflow is a bit more "meta" in a sense that we take your Python function as-is, which may use e.g. Tensorflow and PyTorch, and we execute it as an atomic unit on a container. This is useful, since you can use any existing ML libraries.

I am not familiar with Prefect.

64StarFox64 · on Dec 4, 2019

Correct me if I'm wrong on the following, as I have not yet used Metaflow (only read the docs):

It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?

Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?

Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?

To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?

As an aside: hot damn, the API - https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f... Really cool product. Thanks for all the effort poured into this by you and others.

seeravikiran · on Dec 4, 2019

Yes - you should be able to use dask the way you say.

Your first part of the understanding matches my expectation too. Dask single box parallelism achieved by multi processing - akin to parallel map. And distributed compute is achieved by shipping the work to remote substrates.

For your second comment - we leverage pickle mostly to keep it easy and simple for basic types. For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.

savin-goyal · on Dec 4, 2019

Yes, you can use any python library inside a metaflow step.

tomrod · on Dec 4, 2019

Prefect is built by the Airflow core devs after they took their initial learnings and built something new. It's a reasonable orchestration engine.

seeravikiran · on Dec 4, 2019

Our hope with metaflow is to make the transition to production schedulers like Airflow (and perhaps similar technologies) seamless once you write the DAG via the FlowSpec. The user doesn’t have to care about the conversion to YAML etc. So I would say metaflow works in tandem with existing schedulers.

sandGorgon · on Dec 4, 2019

This is very interesting as a goal. Are you saying metaflow is a "Jupyter notebook for airflow developers" kind of a thing ?

seeravikiran · on Dec 4, 2019

I wouldn't exactly say that. Jupyter notebooks don't have an easy way to represent an arbitrary DAG. The flow is more linear and narrative like. That said, we do expect metaflow (with client API) to play very well with notebooks to support a narrative from a business use-case pov; which might be the end-goal of most ML workloads (hopefully). I would like to think of metaflow, as your workflow construct - hopefully making your life simpler with ML workloads when involving interactions with existing pieces of infrastructure (infra pieces - storage, compute, notebooks or other UI, http service hosting etc.; concepts - collaboration, versioning, archiving, dependency management)

sandGorgon · on Dec 4, 2019

sorry- i meant notebook figuratively.

so metaflow is the local dev version of workflow construct. and then export that to airflow/etc compatible format ?

what workflow engine do you guys use and primarily support in metaflow ?

vtuulos · on Dec 4, 2019

the workflow DAG is a core concept of Metaflow but Metaflow is not just that. Other core parts are managing and inspecting state, cloud integration, and organizing results.

At Netflix, we use an internal workflow engine called Meson https://www.youtube.com/watch?v=0R58_tx7azY

caravel · on Dec 6, 2019

Correction: Prefect is built by one of the Airflow committers. At this time Airflow has ~50 committers. https://whimsy.apache.org/roster/ppmc/airflow

thundergolfer · on Dec 4, 2019

> It's a reasonable orchestration engine.

Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.

workthrowaway · on Dec 4, 2019

so happy to learn about it. i have used airflow in the past and it seems they have addressed various pain points with this new library.

pbalau · on Dec 4, 2019

Sorry to be snarky, but after all this great job, I still have to put the db password in clear text, as an environment variable...

Why not use secrets manager for this? It can even rotate the secret with not much headache.

/edit: I could have a wrapper script that reads the secret and then os.execve()...

savin-goyal · on Dec 4, 2019

Good point. We will address it.

navinsylvester · on Dec 4, 2019

> using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily

Can you please explain how you were able to better the performance of aws cli.

vtuulos · on Dec 4, 2019

you need more connections than what a single AWS CLI process open to saturate network on a big box. You can achieve the same by doing some "xargs | aws cli" trickery but then error handling becomes harder.

Our S3 client just handles multiple worker processes correctly with error handling.

manojlds · on Dec 4, 2019

Check s4cmd - https://github.com/bloomreach/s4cmd

khc · on Dec 4, 2019

What does it mean to be able to pull over 10Gbps? With one object or many objects in the same prefix?

seeravikiran · on Dec 4, 2019

With many objects under the same S3 bucket - say for a flow or a run (with many tasks).