Hacker News new | past | comments | ask | show | jobs | submit login

How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?



Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.

Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful


This is super cool, thanks for pointing that out. Is the hard part done by Pachyderm or as some layer over container file systems?


Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.

FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.


> You can run it on Minikub

Thanks for the tip. I just started down the k8s path from bare metal cluster and will try this.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: