From Google to the world: the Kubernetes origin story

dagenix · on Aug 5, 2018

Its an interesting article. Though, I kinda feel like we're missing a good chunk of the story. As dominate as AWS is now, it was probably even more dominate in the time period we're talking about due to really haven gotten a big head start. And every time that AWS rolled out some new proprietary feature, it tied users even more tightly to the AWS platform.

IMO, what Google realized they needed was a technology prevent people from getting more dependent on AWS. Being open source is key to that goal - if they just offer another proprietary service, its hard to go to someone and tell them not to use an AWS proprietary service but to instead use a GCP one. However, with Kubernetes being open, it means that users can host their own cluster and make it easy for them to move between cloud providers. For customers on AWS, I feel like this provides a bit of an incentive not to use too many AWS proprietary services or otherwise they lose the advantage of mobility. For everyone else, GCP gets to offer the strongest Kubernetes implementation and draw customers in that way.

Its pretty win-win for Google. The article talks about the decision to open source Kubernetes as some sort of decision for the betterment of the world. But if that were really the only reason, why not open source Spanner of BigTable? And that's not to say that Google did anything wrong - corporations are designed to make decisions that are in their own best interests. What I think is interesting, is that this is a really powerful example of how sometimes things that are generally beneficial to the world AND business interests line up together, and when that does, kinda cool things can happen.

jchw · on Aug 5, 2018

>why not open source Spanner [or] BigTable?

Because it's genuinely difficult to open source things like that. They're way too tied to Google and have many closed-source dependencies on internal libraries and services. I think they regret missing the opportunity on some of these and that may reflect why Google has started open sourcing more 'core' things like gRPC and Bazel - it might make it easier.

mrep · on Aug 5, 2018

Totally! You could steal any big tech companies entire source code and it wouldn't do you much good in replicating it as you would realize most of it is entirely dependent on thousands of internal services maintained from that source code that depend on each other and if each one is not running on the companies specific infrastructure it is designed to run on, it won't work. Also, all those services are dependent on the companies internal company specific server hardware and network designs that take years to plan, buy, and build which by the time you did, would be out of date.

akqu · on Aug 5, 2018

I think that is an underappreciated point. What makes a lot of open source software productive only exists in proprietary code, infrastructure and knowledge guarded as a competitive advantage at large companies. I have never really understood why companies like that gets so much open source credit while someone independently trying to make a living shipping a proprietary product for an open source system, making that system more attractive, gets a lot of hate.

Game_Ender · on Aug 5, 2018

Google does not use gRPC internally it, like k8s, it is an open re-implementation of an internal google technology. Also, like k8s & Bazel, it is less advanced and lower performance than their internal technology.

At a high level Google is open sourcing things that have genuine value, but also make integration with their money making services easier.

jsolson · on Aug 5, 2018

Note: mix of facts and opinions ahead; the facts mostly boil down to "all of these technologies _are_ used within Google"; the opinions are my own thoughts on what our internal future looks like. Other Googlers may see a different future or different reasons for the status quo :)

We do use gRPC internally, but the bulk of services were built prior to its existence/maturity and use its predecessors. My expectation is that services will eventually migrate, but that's a long way out. I'd say it's less to do with performance than a clear reason for teams to migrate, and several compelling reasons not to (yet).

Bazel is mostly a subset of Blaze, and many packages would "just work" if pulled into a Bazel workspace. Those that wouldn't are mostly broken because they rely on proprietary (but in many cases deprecated) Blaze features. I see a handful of changes go by every week migrating packages to newer Bazel-friendly definitions (a big one here is the historical handling of proto_library rules).

Kubernetes is used for some things built on GCP, but that's mostly suitable for green-field work (it's easier to integrate with existing things on Borg if you just run on Borg).

saalweachter · on Aug 5, 2018

Eh, you shouldn’t use whether an open sourced (Google-driven) equivalent to an internal Google project is widely used internally as any sort of guide to which is better or worse.

Google SWEs, like most software engineers, have a lot of work to do, and “learn a new framework that is sort of the same as the framework you’re using now and convert the entire code base to it” is pretty low on anyone’s list. We all have to do it from time to time, but you usually wait until you’ve had a few new employees learn the new framework and use it on a couple of smaller new projects, and then wait until you’re doing a big enough rewrite that the framework change is noise.

My guess is that the the major differences between an open source Google project and an equivalent internal one are (1) mistakes that are too deep in the original design are corrected, (2) some edge-case features are not immediately available, (3) the software is written to be run without a team of 10 SREs managing it, and (4) some novel bugs are created that will take a few iterations to correct.

I’d be surprised if there was a significant “performance” difference; the people writing it will probably be able to benefit from the experience on the first implementation to not make too many mistakes (and they have solid performance targets to shoot for, as opposed to not knowing what an efficient implementation should look like).

Everyone complains about (2) and (4), which are the inevitable result of developing software in the open, but you cannot imagine how wonderful (1) is. I use k8s for my projects at home and borg for my projects at work, and dear god, the configuration language differences alone. (Don’t spit into the wind; don’t tug on Superman’s cape; don’t become the borg configuration expert for your team.)

shaklee3 · on Aug 5, 2018

You implied this, but Google does use k8s internally. I saw a recent tweet where two employees (both kubernetes maintainers) said it's used more and more.

lhecker · on Aug 5, 2018

> Also, like k8s & Bazel, [gRPC] is less advanced and lower performance than their internal technology.

Are you sure about that? As far as I know gRPC is literally just as good as Bazel, which is why Google is even migrating to it internally.

For instance this comment agrees with me: https://news.ycombinator.com/item?id=12348286

clhodapp · on Aug 5, 2018

Note that Bazel is a build system (https://bazel.build). I believe you are confusing it with Stubby. That's not to say that the general thrust of what you're saying is necessarily wrong though.

ebikelaw · on Aug 5, 2018

I think you are misreading that comment. He’s saying that stubby is fast, not grpc. In fact performance is the big unknown with grpc adoption within google. It definitely isn’t on par with stubby today and it has to get there before anyone significant will switch to it.

Xorlev · on Aug 5, 2018

I can't comment on raw numbers (because I simply don't have them) but at least for the service I work on, replacing Stubby with gRPC wouldn't really move the needle even if it was 2-3x slower (it might be faster, this is just for illustration) -- we spend our time waiting on IO from other services or crunching numbers in the CPU. Being a Java service, gRPC/Java might well be just as fast or faster than Stubby Java, but I could understand that Stubby C++ has been hyperoptimized over the years vs. gRPC C core which might have a ways to go. By the latest performance dashboard [1, 2], gRPC/Java is leading the pack but gRPC C++ doesn't seem like it's slouching too much either. I seem to remember the C++ impl crushing Java at performance a while back, so I'm sure that'll change in the future.

Honestly though? It'd take a _very_ demanding workload such that your RPC system was the bottleneck (so long as they're within constant factors of each other). There are services like that, but they're the exception and not the norm. Most services don't need to do 100kQPS/task. Even then, at that point you're spending a lot of time on serialization/deserialization, auth, logging, etc.. Your service is more than its communication layer, even if that's important to optimize it's still just a minor constant factor.

The real problem is inertia. There's a lot of code/tools/patterns built up around Stubby and the semantics of Stubby (including all its features which likely haven't been ported to gRPC yet) and that's difficult to overcome.

Our #1 use of gRPC so far I would imagine is at the edge. gRPC is making its way into Android apps since it's pretty trivial for translating proxies to more or less 1:1 convert gRPC to Stubby calls.

[1] https://performance-dot-grpc-testing.appspot.com/explore?das...

[2] https://performance-dot-grpc-testing.appspot.com/explore?das...

ebikelaw · on Aug 5, 2018

You and I seem to be using a different denominator to quantify "most" services. I'm thinking of it as "most" in terms of who has all the resources / budget. You seem to be thinking of it in terms of sheer number of services or engineers working on them. The fact is that the highly demanding services have the huge majority of the resources, and are the most sensitive to performance issues. If your service uses 10% of Google's datacenter space, you won't accept a 5% or even 1% regression just so you can port to gRPC, because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.

Totally agree that world-facing APIs will all be gRPC and that makes perfect sense to me.

Xorlev · on Aug 5, 2018

> You seem to be thinking of it in terms of sheer number of services or engineers working on them.

I'm not sure where I said that, but yes, that's part of the switching cost.

> The fact is that the highly demanding services have the huge majority of the resources, and are the most sensitive to performance issues. If your service uses 10% of Google's datacenter space, you won't accept a 5% or even 1% regression just so you can port to gRPC,

The thrust of my statement was that for many services, RPC overhead is minimal. So even a 2x or 3x increase in RPC overhead is still minimal. I agree, a 5% increase in resource utilization for a large service is something that would be weighed. But lets explore that idea for a moment:

> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.

Not necessarily. Engineers are expensive and becoming ever more expensive while computing resources are becoming increasingly cheaper. Not only that, but engineers tend to be more specialized and so you can't just task anyone to maintain the previous system, it tends to be people with deep expertise already. And those people also have career aims to do more than long-term support of a deprecated system, so there's retention to be considered.

Pretending for a moment that all your services except a small handful moved on to somme system B from some system A, if the maintenance burden of maintaining system B starts to eclipse the resource cost of moving to system A (which decreases all the time due to improvements in system B and the increasing cost of maintaining system A, and the monotonic reduction in computing resource cost), then you might well just swallow the 5%-10% increase in resources either permanently or temporarily and come out ahead in the end.

Additionally, as system B moves on, staying on system A becomes increasingly risky: security improvements, features, layers which don't know about system A anymore all threaten the stability of your service. If you've checked out the SRE book, you'll know that our SLOs are more important than any one resource. If nobody trusts your service to operate, then they won't use it and then you won't have to worry about resources anymore since the users will have moved on.

> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.

To reiterate the point above, these roles tend to be fairly specialized and hard to staff. Arguably these same engineers are better tasked making system B good enough to switch to so you can thank system A for its service and show it the door.

Bringing this back to Stubby vs. gRPC, it's a pretty academic argument so far. They're both here to stay. And honestly, when we say "Stubby" there's already different versions of Stubby which interoperate with each other and gRPC will not be any different. Likewise, we still use proto1 in addition to proto2 and proto3 (the public versions) since that just takes time and energy to fix.

We do make these kinds of decisions every day, and it's not always in favor of reduced resources. If we cared for nothing other than resource utilization, we'd be completely C++, no Java, no Python. Realistically, the cost of maintaining systems with equivalent roles can often lead to one or the other winning out, usually in favor of maintainability so long as their feature sets are roughly equivalent. We're fortunate to be in a position that we can choose code health and uniformity of vision over absolute minimum resource utilization. And again, even if we choose system B (higher resources) over system A, perhaps due to the differences in architecture or design choices the absolute bar for performance of that system will be greater than system A, despite starting lower. Sometimes it takes a critical mass of adopters to really shake out all those issues.

I know that quotes from Knuth are often trotted out during these kinds of discussions, but it's true: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

That 3% is where we choose to spend our effort, and that critical 3% includes the ability of our engineering force to make forward progress and not be hindered by too much debt. It also includes real data, check our Google Wide Profiling [1].

> Totally agree that world-facing APIs will all be gRPC and that makes perfect sense to me.

Probably not all. We still fully support HTTP/JSON APIs, but at least in our little corner of the world we've chosen to take full advantage of gRPC.

Anyways, thanks for letting me stand on my soapbox for a bit.

[1] https://storage.googleapis.com/pub-tools-public-publication-...

ebikelaw · on Aug 5, 2018

Interesting that you allude this the coexistence of C++, Java, Python, and Go because I think this bolsters my point. The overwhelming majority of services at Google are in C++. There are individual C++ services that consume more resources than all Java products combined. I think this speaks to the appetite for performance and efficiency within the company, since it is demonstrably the most difficult of these languages.

dagenix · on Aug 5, 2018

Good points. But, they also faced the same issue with Borg - presumably it was hard to open source, so, they created Kubernetes instead. There is nothing stopping Google from doing the same thing with Spanner or BigTable. Such a move would probably be well received by the community. It seems likely that people at Google would want to work on such projects - AIUI, CockrochDB was created by ex-Googlers. (And I didn't mean to focus on those technologies too specifically - with HBase and CockroachDB already existing, maybe Google wouldn't want to compete, but, I suspect they have other technologies for which creating open source versions would be well received, but they choose not to).

My point isn't that Google could directly open source those technologies. Nor is it that they should create open source versions of them. I'm more just musing on how when the selfish reasons to do something line up with the good reasons to do something, you tend to get cool results.

At a previous job, we had this process that everyone was supposed to do via a script. Doing the process via the script meant we got the logging that we wanted and such. Everyone was supposed to do it this way. However, there were no technical controls preventing anyone from doing that process by hand. We just told people not to. So, one day, a colleague asks me how to make sure that everyone does the process the right way. And my answer is that everyone already does - the script is way easier to run that to do it by hand. The good thing to do (run it the way you were told to), lined up with the selfish reason (it was way easier to do that way). Making sure those two things were in alignment was effectively as powerful as making it impossible to do the wrong way. (since then, the process has been significantly changed).

I feel like its a similar case here. The good thing to do was to create a cool technology and make it free for everyone. The selfish thing to do, was to create a technology that made business sense. The result, is Kubernetes was created as open source (at least, thats my theory). If that wasn't the good thing to do, maybe the person writing the article doesn't talk to the VP on the shuttle about it and it never happens. If it wasn't the selfish thing to do (business case), maybe the VP wouldn't approve it. But, getting those two things aligned, results in positive reinforcement and results.

jchw · on Aug 5, 2018

They mention some of the benefits of open sourcing in this article:

>We always believed that open-sourcing Kubernetes was the right way to go, bringing many benefits to the project. For one, feedback loops were essentially instantaneous — if there was a problem or something didn’t work quite right, we knew about it immediately. But most importantly, we were able to work with lots of great engineers, many of whom really understood the needs of businesses who would benefit from deploying containers (have a look at the Kubernetes blog for perspectives from some of the early contributors). It was a virtuous cycle: the work of talented engineers led to more interest in the project, which further increased the rate of improvement and usage.

Of course, open sourcing software was never about charity - but I do think that it was always a community thing, and big companies choosing to be a part of that community are neither selfish or selfless for simply participating.

dagenix · on Aug 5, 2018

> Of course, open sourcing software was never about charity - but I do think that it was always a community thing, and big companies choosing to be a part of that community are neither selfish or selfless for simply participating.

Strongly agree. And to be clear, I didn't use the term "selfish" as a pejorative - I think its pretty reasonable to expect any entity to analyze how a decision will directly benefit them and to act accordingly.

simonebrunozzi · on Aug 6, 2018

Disagree. Take a look at CockroachDB, mostly based on Spanner.

Google open sourcing spanner would have helped tremendously.

ben_jones · on Aug 5, 2018

Good points, I have my own inferences too:

At the time of k8s inception GCP was behind AWS in UX and features. App engine had (very arguably) failed, if only from its well deserved high potential. The allure of kubernetes tied people over while Google played catch-up. Now many of Google's services are superior to AWS (very arguably). Developers notice the better network performance, the improved GCP console UI, the GCP mobile app, etc.

Now with kubernetes maturing to the point of production usability Google's sales team is starting to outperform AWS. AWS will have dominate market share for many years to come, but GCP is in the drivers seat.

The generous free tier of GCP (both the $300 credits and the free-tiers of various services) are a big help as well.

thesausageking · on Aug 5, 2018

> The article talks about the decision to open source Kubernetes as some sort of decision for the betterment of the world.

I've never liked that Google feels the need to couch everything in making the world better.

Their open source strategy has mainly been strategic business moves to stop competitive threats. Android was about stopping Apple from dominating mobile OSs. Kubernetes was about stopping AWS from dominating cloud. TensorFlow was about making sure Nvidia didn't dominate deep learning.

xapata · on Aug 5, 2018

Google purchased Android before the iPhone was created. I agree that open source is good business, but some of the causes you listed aren't the whole story.

Also, they learned a lesson from Hadoop. If they'd open-sourced Google MapReduce, the cloud industry might look very different.

stingraycharles · on Aug 5, 2018

This is precisely on point and is a proven strategy for Google. Chrome is another example.

Almost all their big open source projects are to mitigate existing threats, and it’s working out very well for them.

It’s annoying to see them trying to frame it as a “public service”, but guess that’s how to do proper PR.

tangentspace · on Aug 6, 2018

I don't get any odd vibe like that from this article. They are pretty upfront about the business motivations in this post:

> with the launch of our Infrastructure-as-a-Service platform Google Compute Engine, we noticed an interesting problem: customers were paying for a lot of CPUs, but their utilization rates were extremely low because they were running VMs. We knew we had an internal solution for this.

Later,

> most importantly, we were able to work with lots of great engineers, many of whom really understood the needs of businesses who would benefit from deploying containers

That does not read like a decision to pursue this for the betterment of the world. It sounds like a decision to pursue this for the betterment of their customers. That's capitalism, not benevolence, and they're pretty transparent about it.

barrkel · on Aug 5, 2018

Yup. Kubernetes is a clear platform play to commoditize AWS, their biggest competitor in the cloud platform space. The strategy wouldn't have worked without open sourcing it, because it relied on people building k8s clusters in AWS, even if AWS didnt support it directly.

Leaving out the major business rationale for k8s makes the article seem childish.

swaroop · on Aug 5, 2018

Agreed! This was also explained nicely in https://stratechery.com/2016/how-google-cloud-platform-is-ch...

dekhn · on Aug 5, 2018

leveldb with a server wrapped around it is substantially like bigtable.

Xorlev · on Aug 5, 2018

If you really want Bigtable, go use Apache HBase. It's very mature these days and performed really well at $dayjob-1.

There's a lot that goes into turning LevelDB into a real database. :)

SkyRocknRoll · on Aug 5, 2018

Well written.

telltruth · on Aug 5, 2018

I am bit tired by these VP-speak articles. So what really was the discussion with Eric Brewer? What arguments made him suddenly agreed to open source? Why was Urs rejected this before? Why are you not allowed to write about this?

This whole article is like “I talked to my VP and convinced him to open source internal tool. We at Google are the best and everything we have build is the best and here’s the link for free trial.”

And how the heck did you got “number of years” worth of coding number?

mlazos · on Aug 5, 2018

Yeah I felt like barfing when he said “Good ideas usually win out at Google, and we were convinced this was a good idea.” It’s like dude you’re using an anecdote about running into a VP on a shuttle ride and saying this? Google is such a magic non-political place.

amarkov · on Aug 5, 2018

Most business ideas aren't rejected for any kind of compact reason. The decisionmaker typically has other things that are higher priority, or they just kinda avoid giving a green light until it's interpreted as a rejection.

pi-squared · on Aug 5, 2018

> ... And, on top of that, you want to open source it?

(speculative continuation)

> But if we do that and make people believe that containers - the massive, heavy, broken abstraction - are the future, and provide complicated infrastructure that will magically fix the problems of containers, and then also provide this complicated infrastructure managed - this will be our way to beat AWS. It will be difficult and clumsy to set it up on your own - you need many machines to set up the cluster so that it's "Google-scalable" and "fault-tolerant" so for most people and companies it will be way too much hassle and too expensive to manage their own cluster purely on VM or physical instances. We "just" provide you with the best managed infra, because c'mon - it will be open source and even given up control on paper - but everyone will associate it with us, we will make sure to have podcasts, and blogs, and marketing that talk about containers, the future and how G created this project. Best people will help us build it "out in the open" and then when we hire them, it will be easy to teach them this Borg monstrosity that we have here. So you know win-win-win - devs think they solved their Docker-is-shit problem with the magic of Kubernetes (yeah, I got a name for it already) so G is now savior; it's a win for GCP; it's a win for hiring.

> [Urs]: Now we talkin...

tnolet · on Aug 5, 2018

That made me chuckle. I’m afraid it might even be half true. Someone should write a good blog post on how Kubernetes is a tech marketing success story. Same could be said for Docker of course.

keypusher · on Aug 5, 2018

A turning point was a fateful shuttle ride where I found myself sitting next to Eric Brewer, VP of Cloud, [...] Soon after, we got the green light from Urs.

This seems like a red flag for management at Google, if the best way to pitch ideas up the chain is hoping you can ambush someone important on the company shuttle.

traskjd · on Aug 5, 2018

You say red flag, others might say 'how the world really works'. There's a lot more luck and talking to the right person than we'd like to admit in this world.

mpweiher · on Aug 6, 2018

To me "this is a red flag" and "how the world really works" are not incompatible statement.

slashdev · on Aug 5, 2018

The strategy appears to be working, I notice GitLab announced a migration from Azure to GCP and the reason given is to have better Kubernetes support. https://about.gitlab.com/2018/06/25/moving-to-gcp/

ATsch · on Aug 5, 2018

I'm sure the fact that Google ventures is afaik gitlab's biggest investor has a lot to do with it too.

williamchia · on Aug 5, 2018

GV operates independently of Google (part of why the name changed officially form “Google Ventures” to GV) GitLab parters with and supports multiple clouds - e.g. GitLab announced official support for Amazon EKS, and IBM Cloud runs GitLab Core for version control.

dekhn · on Aug 6, 2018

GV isn't truly independent. Best way to describe it is that multiple companies in Alphabet have access to Google's tech stack and resource manager. In the early days, I advised against GV trying to get its funded companies to GCP because it wasn't mature enough and I wanted the startups to just be successful, not be political statements about Google's cloud.

That changed over the past few years; now I would easily recommend that startups funded by GV use Google Cloud if they wanted to.

rhymenoceros · on Aug 5, 2018

Title needs a (2016).

iamgopal · on Aug 5, 2018

I am soon to venture in to SAAS business. Should I be interested in kubernetes ? Is old way of doing things still relevant?

borplk · on Aug 5, 2018

If you are indie avoid it until you have 100 users and some revenue.

The last thing you want is to be fiddling with k8s when you have 3 whole users and $0 revenue.

Unless your niche strongly requires fancy architecture start with a $5 VPS and take it up from there.

You will never look back and think "shit if only I had used k8s on day one it wouldn't have failed".

More often it's closer to "shit if I had focused on non-tech stuff more maybe it would have gone somewhere ... instead I spent 2 months fiddling with YAML files".

Arqu · on Aug 5, 2018

I strongly disagree. Initially, I've been reluctant to do so myself, however, once I read a bit more of it, it's actually fairly simple for simple things and just a bit more complex for very complex stuff. It's no harder or more involved than setting up a VM on your own. The cost is also pretty much the same for small scale as you basically have free k8s masters by now.

Also if you host a bunch of side projects, it's actually better in terms of resource utilization and separation of projects, it dynamically schedules pods based on resource requests across the nodes it has at hand so you can host several projects fully separate on a single or a handful of nodes depending on requirements.

tnolet · on Aug 5, 2018

This completely depends on your business. Are you running many cloud based workloads in a Dockerized manner? Are you having trouble orchestrating, monitoring and scaling those workloads? Then use Kubernetes. In most if not all other cases, Heroku will probably get you through the first couple of years.

brianwawok · on Aug 5, 2018

I would look hard at k8s or serverless. I went with k8s for my SaaS business and it is a huge competitive edge.

Aduket · on Aug 6, 2018

what do you mean by "competitive edge", do you regret choosing it? and why?

brianwawok · on Aug 8, 2018

Competitive edge. Like it helps me beat my competition. My competitors deploy once a week. I deploy 30x a day. I get CI/CD for near free, which is amazing. All else equal between us, this will let me beat them to market.

The opposite of regrets. I suggest k8s or serverless for almost anyone...

jonny_eh · on Aug 5, 2018

Wait, so Kubernetes is Google's take on Docker, invented independently?

tybit · on Aug 5, 2018

No it’s Googles take on container/compute task scheduling, deciding when and on which machines a task (e.g a docker container) should run. Docker Swarm is Dockers offering in the same category as Kubernetes.

Edit: Though Google could be credited with introducing containers to Linux, as they added the main missing piece, cgroups, to the Linux kernel.

throw2016 · on Aug 7, 2018

Cgroups are not container specific and are used to limit resources to processes ie cpu, memory. Namespaces are what makes Linux containers possible.