Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pachyderm Hub: data science without the hassle of managing infrastructure (pachyderm.com)
107 points by jdoliner on Sept 3, 2020 | hide | past | favorite | 31 comments



I'm quite excited for the emerging Python-, Git-, and DAG-driven data engineering workflows/orchestrators that are code-first, GUI-second. But is anyone daunted at the prospect of investing in one of these platforms, then having to pivot to the paradigm that emerges as the winner 3-5 years from now? I've been trying to keep up with all the tools, but am pretty overwhelmed at this point. Obviouly there's differences, but off the top of my head I can think of: Airflow, Luigi, Pachyderm, DAGster, Google Cloud Compose, Kuberflow, MLFlow, Azure ML Pipelines, Ascend.io, and so so many more.


This is unfortunately always going to be a problem with a developing space. Pachyderm was designed with this issue in mind though. While I don't think we've totally eliminated, or can totally eliminate it, we have done some things to mitigate it. Pachyderm pipelines run your code in a docker container that you define which reads data from the local filesystem. It's designed to be simple and similar to the interface you'd use when playing around with data on your laptop. User's generally have a very easy time migrating existing code into Pachyderm, since often code is already reading from the local filesystem and just needs to be pointed at a new path. Migrating to a different system is normally a little more complicated because it requires integrating whatever that new systems data interface is. Although in another system where you just read from local disk the migration should be quite simple.


Shameless plug: my company Weights & Biases (https://wandb.com) has tackled this in a different way. We make tools to keep track of results across your pipelines regardless of how you choose to orchestrate execution. This is one of our big selling points: very simple on-ramp to reproducible tracking, and infrastructure agnosticism. These have led to broad adoption across companies and academics. We also have a pretty cool UI.

Pachyderm makes different tradeoffs and we're excited to see their launch. Seriously congrats to you all! This is an invigorating space to say the least ;).


We have at least one customer who's using both wandb and Pachyderm together. I actually don't think they overlap that much, although I may be misunderstanding that wandb does. Pachyderm doesn't do anything to track models explicitly. Our tracking is specifically about data lineage, i.e. what data was fed into this container to create this result. It looks like wandb does do dataset versioning but it's unclear to me if it's storing references to versions of the dataset or if it's actually the source of truth that's storing the data. I think it's the former but I'm not sure. Pachyderm focuses on the later of being the system that stores and versions the large datasets and presents a unified hash for them. Other systems can then record that hash to have immutable versioned datasets they can rely on.

We do have a few philosophical differences though. The biggest being that we're opposed requiring data-scientists to including tracking code to trace their results. Every company we've seen use systems like that winds up with different levels of instrumentation on different pipelines and a lot of experiments with no instrumentation. It makes it hard to answer questions like "what are all the places this data is being used?" conclusively, because there's always this "dark matter" of code that hasn't been instrumented that your code doesn't see. We prefer a system that automatically tracks things without asking the user to do anything so that information is always collected and there when you need it. That being said, I'm not sure how you could do the type of finegrained model tracking you guys do automatically. We can track lineage automatically because we're the system that stores and exposes the data, so we know when it's being accessed.


FWIW, our Artifacts tool can track references to data that exists in other systems, or store the data directly which allows us to ensure there are no "dark matter" users like you mentioned. It's a flexible system.


The key difference is that it's something you _can_ do, rather than something that just happens automatically. It's not about what you're able to do, it's about what you can forget to do. Storing references to external systems is something that people do in Pachyderm as well, but we discourage it because those external systems generally aren't immutable. So they can be deleted, or even worse change, at which point your lineage tracking is actually misleading you rather than informing you.


sorry I forgot about W&B!


> But is anyone daunted at the prospect of investing in one of these platforms, then having to pivot to the paradigm that emerges as the winner 3-5 years from now?

This will always be an issue if you’re an early adopter. The three mitigating calculations every case are: 1 - somebody* will probably be the “winner”, and could be it. 2 - between now and 3 years from now will you get enough value that it will be worth migrating to the eventual winner rather than doing without and 3 - three years from now you can just start new projects on the winning tech; most or all of the current projects will have withered and/or died anyway.


You are so right. The market is super fragmented. And it feels like there is a new workflow tool everyday. Even I had built one similar to Kubeflow.

Finally, I think the survival of tools will come down to being able to compete on the price with leading cloud vendors.

Just like the choice of programming languages the decision would get down to your team's past experience and cost of ownership.


There isn’t a way around it. The best thing to do is to try and build your data pipelines in a way that makes it somewhat easier to swap tools if required, but I’m not sure if there’s a general way to do that.


After following Pachyderm for a long time, the docs and everything finally look to be at a point where I really want to try it out!

There is no point though in hiding the pricing behind the signup, especially if it's the direct first step you see after you signed up. All you are doing is inflate your signup numbers (and maybe even deter a few people).

PS: The link under "Resources -> What's the typical developer workflow?" is broken.


Thanks for staying involved with the project, we're so glad to hear that the improvements are working.

You bring up some good points about the pricing page. We'll be making the pricing visible without login soon. If you have some time we'd love to get some more feedback via our slack channel (pachyderm.com/slack). With you being involved for so long your feedback would be super insightful.

(Fixed that link. email us and we'll send you some Pachyderm swag if you like)


I agree. I left the page immediately when it wasn't clear what the pricing would look like. No use investing any of my time into something I don't know if I can afford or get into my work budget.


yup. I've seen "hidden" prices be as low as $100/yr to as much as $10k/mo. Now I don't dig in deeper and just move along unless I'm desperate.


Awesome!

We switched from airflow to pachyderm recently, and we have loved it (we still have some legacy on airflow, but all new data engineering dev is in pachyderm).


Thanks for the kind words, Airflow -> Pachyderm has been an increasingly common migration path for us.


The data science platform looks great, but I have a fundamental issue with building a data versioning platform using git-like semantics - Pachyderm and DVC follow this approach. Git-like approaches just track immutable files - they do not store the differences between files. That's fine when you're working with small amounts of data, but doesn't scale to handle large volumes of data and prevents you from making some times of time-travel queries - give me the data that changed in this time range. The emerging enterprise approach to handle this is to use a framework like Delta.io, Apache Hudi, or Apache Iceberg. These frameworks add metadata over Parquet, and allow you to do point point-in-time time-travel queries as well as interval-based time-travel queries (give me data that changed in this time interval). Parquet is becoming a dominant file format for data lakes and is cheap to store on s3.

Disclosure: I am a co-founder of Logical Clocks.


Looking at this - it seems very similar to Verta.AI Both are doing data lineage/workflow management ~ git for data science workflows. Verta recently raised their series A and launched their platform too.


I'm not super familiar with Verta.AI but reading about their product it seems pretty different. I read Verta.AI as a model deployment and tracking solution. Seldon would be the opensource comparison that immediately comes to mind. Pachyderm is more of a general pipeline and data version control system. You can do some very basic model deployment and tracking in it but we general recommend people use an external system such as Seldon and even have some published examples of how to integrate the two.

Disclosure: I founded Pachyderm.


I wanted to install R to try it and test it to see if it was something I could work with on MacOS (Mojave).

It was nothing short of an f-ing nightmare.

Wonder if this service will help.


I don't think this solves that problem, or intends to. You can run R in pachyderm via Docker, but that's unlikely to be a good way to test R out.

If you like Docker, try the images from Rocker: https://www.rocker-project.org/. The one with RStudio server is especially useful for local experimentation.


Thanks, I'll take a look a this.


:wave:


We have lots of users who develop pipelines in R. Here's the standard iris example in R (example also include Python and Julia) if you're interested: https://github.com/pachyderm/pachyderm/tree/master/examples/...


> It was nothing short of an f-ing nightmare.

Honestly I think I've only ever heard of installation issues from Windows users. What was the problem?


The problem was with C dependencies. I'd install Xcode CLI tools, that didn't work then I'd try installing them with brew and that didn't work.

I was trying to get Jupyter Notebook and an R engine to work as advertised in a Packt ebook.

Only way I got it to work was not by installing it on MacOS but a Vagrantbox and then doing port fowarding.

But I'll have to look into the Docker way. Maybe that way is better.


I often need Dataflow, Spark Engineering, NoSql and Sql datastores, enterprise security and hybrid cloud with my datascience https://docs.cloudera.com/machine-learning/cloud/product/top...


Quick note for @jdoliner: there's a typo on the homepage in the Useful Links section - "Technicial Slack Channel"


Can this run in a peered VPC connection or will I have to pay for data transferred from my public cloud infrastructure to this service?


It can't right now, it all runs in GCP. You can still spin up Pachyderm yourself in whatever environment you like. We're working on enabling custom deploys in more environments than the one we just have but it's still a little ways out.


Couldn't find any info on pricing. How much more expensive is this than running your own cluster?

P.S. Couldn't login. Stuck on https://hub.pachyderm.com/pricing. I am just seeing a white screen. (Login with Google, Linux, Firefox, Ukraine)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: