Shameless plug: my company Weights & Biases (https://wandb.com) has tackled this...

jdoliner · on Sept 3, 2020

We have at least one customer who's using both wandb and Pachyderm together. I actually don't think they overlap that much, although I may be misunderstanding that wandb does. Pachyderm doesn't do anything to track models explicitly. Our tracking is specifically about data lineage, i.e. what data was fed into this container to create this result. It looks like wandb does do dataset versioning but it's unclear to me if it's storing references to versions of the dataset or if it's actually the source of truth that's storing the data. I think it's the former but I'm not sure. Pachyderm focuses on the later of being the system that stores and versions the large datasets and presents a unified hash for them. Other systems can then record that hash to have immutable versioned datasets they can rely on.

We do have a few philosophical differences though. The biggest being that we're opposed requiring data-scientists to including tracking code to trace their results. Every company we've seen use systems like that winds up with different levels of instrumentation on different pipelines and a lot of experiments with no instrumentation. It makes it hard to answer questions like "what are all the places this data is being used?" conclusively, because there's always this "dark matter" of code that hasn't been instrumented that your code doesn't see. We prefer a system that automatically tracks things without asking the user to do anything so that information is always collected and there when you need it. That being said, I'm not sure how you could do the type of finegrained model tracking you guys do automatically. We can track lineage automatically because we're the system that stores and exposes the data, so we know when it's being accessed.

slewis · on Sept 4, 2020

FWIW, our Artifacts tool can track references to data that exists in other systems, or store the data directly which allows us to ensure there are no "dark matter" users like you mentioned. It's a flexible system.

jdoliner · on Sept 4, 2020

The key difference is that it's something you _can_ do, rather than something that just happens automatically. It's not about what you're able to do, it's about what you can forget to do. Storing references to external systems is something that people do in Pachyderm as well, but we discourage it because those external systems generally aren't immutable. So they can be deleted, or even worse change, at which point your lineage tracking is actually misleading you rather than informing you.

data_ders · on Sept 4, 2020

sorry I forgot about W&B!