Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Shameless plug: my company Weights & Biases (https://wandb.com) has tackled this in a different way. We make tools to keep track of results across your pipelines regardless of how you choose to orchestrate execution. This is one of our big selling points: very simple on-ramp to reproducible tracking, and infrastructure agnosticism. These have led to broad adoption across companies and academics. We also have a pretty cool UI.

Pachyderm makes different tradeoffs and we're excited to see their launch. Seriously congrats to you all! This is an invigorating space to say the least ;).




We have at least one customer who's using both wandb and Pachyderm together. I actually don't think they overlap that much, although I may be misunderstanding that wandb does. Pachyderm doesn't do anything to track models explicitly. Our tracking is specifically about data lineage, i.e. what data was fed into this container to create this result. It looks like wandb does do dataset versioning but it's unclear to me if it's storing references to versions of the dataset or if it's actually the source of truth that's storing the data. I think it's the former but I'm not sure. Pachyderm focuses on the later of being the system that stores and versions the large datasets and presents a unified hash for them. Other systems can then record that hash to have immutable versioned datasets they can rely on.

We do have a few philosophical differences though. The biggest being that we're opposed requiring data-scientists to including tracking code to trace their results. Every company we've seen use systems like that winds up with different levels of instrumentation on different pipelines and a lot of experiments with no instrumentation. It makes it hard to answer questions like "what are all the places this data is being used?" conclusively, because there's always this "dark matter" of code that hasn't been instrumented that your code doesn't see. We prefer a system that automatically tracks things without asking the user to do anything so that information is always collected and there when you need it. That being said, I'm not sure how you could do the type of finegrained model tracking you guys do automatically. We can track lineage automatically because we're the system that stores and exposes the data, so we know when it's being accessed.


FWIW, our Artifacts tool can track references to data that exists in other systems, or store the data directly which allows us to ensure there are no "dark matter" users like you mentioned. It's a flexible system.


The key difference is that it's something you _can_ do, rather than something that just happens automatically. It's not about what you're able to do, it's about what you can forget to do. Storing references to external systems is something that people do in Pachyderm as well, but we discourage it because those external systems generally aren't immutable. So they can be deleted, or even worse change, at which point your lineage tracking is actually misleading you rather than informing you.


sorry I forgot about W&B!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: