I cannot give enough thanks to everyone that works so hard on Kubernetes. The speed of these releases, community support, industry adoption, feature set, and ease of use are beyond anything I could have asked for.
I started a startup about 18 months ago and our product requires a massive amount data pipelining. The infrastructure we have running on Kubernetes is beyond what a small group should be able to deploy and maintain by ourselves, but Kubernetes makes it not only possible, but enjoyable (really).
To everyone involved (and there are a lot): Thank you SO much. I can't wait to see what 2018 looks like for Kubernetes.
Great post. Could you describe further what you mean by data pipelining? Is this something that you could also use Spark, Kafka for? I'd be interested to know if Kubernetes would work for me as well.
One promising piece of tech is Pachyderm [1], which runs natively on Kubernetes. It allows you to set up pipelines of batch jobs (no streaming yet, as far as I know) that operate on immutable, versioned data.
There are some gotchas, though. For one [2], it requires read-write access to your storage buckets (!), and so on GKE you have to modify your nodepool's access scopes to give the whole node read-write access to all your buckets. Not is this terrible security practice, but access scopes are deprecated on Google Cloud. You should always grant access through service accounts, not scopes. We spent some time fighting this, as there's literally no way to make Pachyderm use a service account.
There are some other niggling issues, and the Pachyderm team has been less than enthuastic in addressing them (e.g. see [3], which has zero activity despite being absolutely crucial to run in production), but it does seem quite promising. The architecture, at least, is sound.
Hey lobster_johnson, Pachyderm CEO here. Thanks for the shout out and we're glad you feel Pachyderm is promising for these use cases. I just wanted to clarify / respond to a few things.
> no streaming yet, as far as I know
Pachyderm actually does do streaming, there's nothing in the API about it because it all happens automatically. This is one of the huge advantages of immutable data, the system can automatically detect if it's already done a computation by looking at a hash of the input and the code. The system will never do the same computation twice, all pipelines are streaming by default.
Sorry we've been less than enthusiastic in addressing these niggling issues. The truth is it's pretty hard to be enthusiastic in addressing issues for specific deployment scenarios. AFAIK the Google Cloud Storage library we're using doesn't support service accounts, the whole library has been deprecated, the second GCS library we've had deprecated underneath us. So upgrading that is a bit of a pain but the bigger pain is that now users with existing Google clusters will have their clusters break under them unless we can figure out a way to get scopes and service accounts to both work. And we have every reason to believe that all of this will be deprecated within the next 6 months and we'll be on to our 4th GCS library. And that's only one deployment scenario, we're doing the same thing for AWS, Azure and a host of local deployment options.
Anyways, these are our problems to solve, and we will solve them. We're thrilled that you want to use Pachyderm and we want to make that experience as smooth as possible. We just need to ask for a bit of patience regarding these deployment scenarios.
Thanks for responding. First of all, I don't buy that "deployment scenarios" shouldn't be a primary concern of yours. Both Pachyderm and Kubernetes are frameworks for deploying code.
Secondly, it's not like Google Cloud Storage is some kind of esoteric platform. You decided to support GCS, and so with that comes to expectation of up-to-date support. The details is where the devil is, and that's where you have to be diligent in ironing out the bumps.
Thirdly, while it's absolutely frustrating that Google shuffles the Go SDK around so much, it seems that that the surface area you need to cover here is rather small, and GCS itself hasn't changed. As far as I can tell, the fix for #2538 requires just a single line change. (I'll comment on the issue.)
That said, I'm sure this is just a temporary speed bump while Pachyderm matures. I'm looking forward to using it.
As for streaming, I know that Pachyderm streams the data internally, but I was referring to real-time streaming processing á la Spark, as opposed to batch jobs. From what I can tell, Pachyderm is not only batch-job-based, but its data model is based around files, so it's not suitable for situations where you have a continuous, fast, unbounded stream of data (e.g. analytics events).
Do you guys have any deployment docs or Kubernetes deployment for on-prem Pachyderm? I tried setting it up a while back on our cluster and it was pretty challenging.
Hopefully that should work for you. Deploying on prem can mean a lot of different things so some paths will work better than others. The most common on prem deployments we see are back by Minio.
Great, thanks for the info. Admittedly, this was a while back and I know docs change quickly in this space.
I agree that on-prem can be a challenge. You really have to know your stuff, and your environment. We run it in production and it has been quite a journey.
Deploying Pachyderm on-prem is very similar to deployment on a cloud provider, you just need Kubernetes and an object store. The challenging part in my experience is getting an on-prem Kubernetes cluster, but this is well-tested territory for Kubernetes these days. We point people towards this guide to figure out the best Kubernetes deployment option for their infra:
As much as I like kubernetes, I find the whole ecosystem to be frustratingly immature and unreliable. I'm sure it's trending in the right direction with respect to reliability, but it just doesn't feel like it's there yet. Minikube seems especially buggy and is missing some obvious features.
Using it on GKE. Literally no tricky parts except figuring out how to keep my images on GCR in sync with my manifests. For that this tool is quite helpful: https://github.com/dminkovsky/kube-cloud-build.
Otherwise it's been such a pleasure. PVs and StatefulSets are a total game changer. Looking forward to 2018.
EDIT:
I should say RBAC has been a little frustrating. But obviously it's for the better.
I started a startup about 18 months ago and our product requires a massive amount data pipelining. The infrastructure we have running on Kubernetes is beyond what a small group should be able to deploy and maintain by ourselves, but Kubernetes makes it not only possible, but enjoyable (really).
To everyone involved (and there are a lot): Thank you SO much. I can't wait to see what 2018 looks like for Kubernetes.