Hacker News new | past | comments | ask | show | jobs | submit login

Can you compare and contrast with tools such as dask, dask-kubernetes, perfect[1]?

[1] https://www.prefect.io/products/core




Dask is great if you want to distribute your algorithms / data processing at a granular level. Metaflow is a bit more "meta" in a sense that we take your Python function as-is, which may use e.g. Tensorflow and PyTorch, and we execute it as an atomic unit on a container. This is useful, since you can use any existing ML libraries.

I am not familiar with Prefect.


Correct me if I'm wrong on the following, as I have not yet used Metaflow (only read the docs):

It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?

Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?

Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?

To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?

As an aside: hot damn, the API - https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f... Really cool product. Thanks for all the effort poured into this by you and others.


Yes - you should be able to use dask the way you say.

Your first part of the understanding matches my expectation too. Dask single box parallelism achieved by multi processing - akin to parallel map. And distributed compute is achieved by shipping the work to remote substrates.

For your second comment - we leverage pickle mostly to keep it easy and simple for basic types. For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.


Yes, you can use any python library inside a metaflow step.


Prefect is built by the Airflow core devs after they took their initial learnings and built something new. It's a reasonable orchestration engine.


Our hope with metaflow is to make the transition to production schedulers like Airflow (and perhaps similar technologies) seamless once you write the DAG via the FlowSpec. The user doesn’t have to care about the conversion to YAML etc. So I would say metaflow works in tandem with existing schedulers.


This is very interesting as a goal. Are you saying metaflow is a "Jupyter notebook for airflow developers" kind of a thing ?


I wouldn't exactly say that. Jupyter notebooks don't have an easy way to represent an arbitrary DAG. The flow is more linear and narrative like. That said, we do expect metaflow (with client API) to play very well with notebooks to support a narrative from a business use-case pov; which might be the end-goal of most ML workloads (hopefully). I would like to think of metaflow, as your workflow construct - hopefully making your life simpler with ML workloads when involving interactions with existing pieces of infrastructure (infra pieces - storage, compute, notebooks or other UI, http service hosting etc.; concepts - collaboration, versioning, archiving, dependency management)


sorry- i meant notebook figuratively.

so metaflow is the local dev version of workflow construct. and then export that to airflow/etc compatible format ?

what workflow engine do you guys use and primarily support in metaflow ?


the workflow DAG is a core concept of Metaflow but Metaflow is not just that. Other core parts are managing and inspecting state, cloud integration, and organizing results.

At Netflix, we use an internal workflow engine called Meson https://www.youtube.com/watch?v=0R58_tx7azY


Correction: Prefect is built by one of the Airflow committers. At this time Airflow has ~50 committers. https://whimsy.apache.org/roster/ppmc/airflow


> It's a reasonable orchestration engine.

Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.


so happy to learn about it. i have used airflow in the past and it seems they have addressed various pain points with this new library.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: