How well does that work when you datasets are a sizeable percentage of available...

jaz46 · on Nov 16, 2018

Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.

Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful

marmaduke · on Nov 17, 2018

This is super cool, thanks for pointing that out. Is the hard part done by Pachyderm or as some layer over container file systems?

ztjio · on Nov 17, 2018

Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.

FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.

marmaduke · on Nov 17, 2018

> You can run it on Minikub

Thanks for the tip. I just started down the k8s path from bare metal cluster and will try this.