I cannot give enough thanks to everyone that works so hard on Kubernetes. The sp...

ericand · on Dec 18, 2017

Great post. Could you describe further what you mean by data pipelining? Is this something that you could also use Spark, Kafka for? I'd be interested to know if Kubernetes would work for me as well.

atombender · on Dec 18, 2017

One promising piece of tech is Pachyderm [1], which runs natively on Kubernetes. It allows you to set up pipelines of batch jobs (no streaming yet, as far as I know) that operate on immutable, versioned data.

There are some gotchas, though. For one [2], it requires read-write access to your storage buckets (!), and so on GKE you have to modify your nodepool's access scopes to give the whole node read-write access to all your buckets. Not is this terrible security practice, but access scopes are deprecated on Google Cloud. You should always grant access through service accounts, not scopes. We spent some time fighting this, as there's literally no way to make Pachyderm use a service account.

There are some other niggling issues, and the Pachyderm team has been less than enthuastic in addressing them (e.g. see [3], which has zero activity despite being absolutely crucial to run in production), but it does seem quite promising. The architecture, at least, is sound.

[1] http://pachyderm.io

[2] https://github.com/pachyderm/pachyderm/issues/2538

[3] https://github.com/pachyderm/pachyderm/issues/2537

jdoliner · on Dec 18, 2017

Hey lobster_johnson, Pachyderm CEO here. Thanks for the shout out and we're glad you feel Pachyderm is promising for these use cases. I just wanted to clarify / respond to a few things.

> no streaming yet, as far as I know

Pachyderm actually does do streaming, there's nothing in the API about it because it all happens automatically. This is one of the huge advantages of immutable data, the system can automatically detect if it's already done a computation by looking at a hash of the input and the code. The system will never do the same computation twice, all pipelines are streaming by default.

Sorry we've been less than enthusiastic in addressing these niggling issues. The truth is it's pretty hard to be enthusiastic in addressing issues for specific deployment scenarios. AFAIK the Google Cloud Storage library we're using doesn't support service accounts, the whole library has been deprecated, the second GCS library we've had deprecated underneath us. So upgrading that is a bit of a pain but the bigger pain is that now users with existing Google clusters will have their clusters break under them unless we can figure out a way to get scopes and service accounts to both work. And we have every reason to believe that all of this will be deprecated within the next 6 months and we'll be on to our 4th GCS library. And that's only one deployment scenario, we're doing the same thing for AWS, Azure and a host of local deployment options.

Anyways, these are our problems to solve, and we will solve them. We're thrilled that you want to use Pachyderm and we want to make that experience as smooth as possible. We just need to ask for a bit of patience regarding these deployment scenarios.

atombender · on Dec 19, 2017

Thanks for responding. First of all, I don't buy that "deployment scenarios" shouldn't be a primary concern of yours. Both Pachyderm and Kubernetes are frameworks for deploying code.

Secondly, it's not like Google Cloud Storage is some kind of esoteric platform. You decided to support GCS, and so with that comes to expectation of up-to-date support. The details is where the devil is, and that's where you have to be diligent in ironing out the bumps.

Thirdly, while it's absolutely frustrating that Google shuffles the Go SDK around so much, it seems that that the surface area you need to cover here is rather small, and GCS itself hasn't changed. As far as I can tell, the fix for #2538 requires just a single line change. (I'll comment on the issue.)

That said, I'm sure this is just a temporary speed bump while Pachyderm matures. I'm looking forward to using it.

As for streaming, I know that Pachyderm streams the data internally, but I was referring to real-time streaming processing á la Spark, as opposed to batch jobs. From what I can tell, Pachyderm is not only batch-job-based, but its data model is based around files, so it's not suitable for situations where you have a continuous, fast, unbounded stream of data (e.g. analytics events).

dsnuh · on Dec 19, 2017

Do you guys have any deployment docs or Kubernetes deployment for on-prem Pachyderm? I tried setting it up a while back on our cluster and it was pretty challenging.

jdoliner · on Dec 19, 2017

We do: http://docs.pachyderm.io/en/latest/deployment/on_premises.ht...

Hopefully that should work for you. Deploying on prem can mean a lot of different things so some paths will work better than others. The most common on prem deployments we see are back by Minio.

dsnuh · on Dec 19, 2017

Great, thanks for the info. Admittedly, this was a while back and I know docs change quickly in this space.

I agree that on-prem can be a challenge. You really have to know your stuff, and your environment. We run it in production and it has been quite a journey.

jaz46 · on Dec 19, 2017

Deploying Pachyderm on-prem is very similar to deployment on a cloud provider, you just need Kubernetes and an object store. The challenging part in my experience is getting an on-prem Kubernetes cluster, but this is well-tested territory for Kubernetes these days. We point people towards this guide to figure out the best Kubernetes deployment option for their infra:

https://kubernetes.io/docs/setup/pick-right-solution/

Pachyderm on-prem guide: http://pachyderm.readthedocs.io/en/latest/deployment/on_prem...

lemoncucumber · on Dec 18, 2017

As much as I like kubernetes, I find the whole ecosystem to be frustratingly immature and unreliable. I'm sure it's trending in the right direction with respect to reliability, but it just doesn't feel like it's there yet. Minikube seems especially buggy and is missing some obvious features.

akvadrako · on Dec 19, 2017

Well minikube is not for production, so maybe it's buginess is evidence they are focusing their time on the more important components.

jaz46 · on Dec 18, 2017

Have you checked out Pachyderm (disclaimer: my company) for data pipelining on Kubernetes?

Our product has been available since the first Kubernetes release and is specifically designed for data analysis and pipelining tasks on Kubernetes.

github.com/pachyderm/pachyderm

pachyderm.io

cirowrc · on Dec 18, 2017

That's so great to hear! I love to see how it empowers small teams to deliver big stuff.

What was the trickiest part for you when it comes to real production usage?

linkmotif · on Dec 18, 2017

Using it on GKE. Literally no tricky parts except figuring out how to keep my images on GCR in sync with my manifests. For that this tool is quite helpful: https://github.com/dminkovsky/kube-cloud-build.

Otherwise it's been such a pleasure. PVs and StatefulSets are a total game changer. Looking forward to 2018.

EDIT: I should say RBAC has been a little frustrating. But obviously it's for the better.

linkmotif · on Dec 18, 2017

Couldn’t have said it better myself. I am so grateful.

ec109685 · on Dec 18, 2017

What type of infrastructures do you run kubernetes on?

slap_shot · on Dec 18, 2017

We run Kubernetes on all the three major clouds: AWS, GCP and Azure.