How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?
Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.
Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful
Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.
FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.