Hacker News new | past | comments | ask | show | jobs | submit login
Common mistakes using Kubernetes (pipetail.io)
549 points by marekaf on May 17, 2020 | hide | past | favorite | 140 comments



I wish there was a way to upvote something 10x once a month here. This would be the post I use that on.

When I was writing my book my editor asked me to remove any writing about mistakes and changes I made in the project for each chapter. I had a bug that appeared and I wanted to write about how I determined that and fixed it. They said the reader wants to see an expert talking, as if experts never make mistakes or need to shift from one tact to another.

But, I find I learn the most from explanations that share how your mental model was wrong initially and how you figured it out and how you did it "more right" the next time.

That's really how people build things.


>They said the reader wants to see an expert talking, as if experts never make mistakes or need to shift from one tact to another.

Your editor was very fucking wrong.


So, so wrong. How did (s)he think experts get so good? Isn't the phrase `the master has failed more times than the apprentice has even tried` well known for a reason?


> the master has failed more times than the apprentice has even tried

I've never actually heard that one before, but it's so very, very true.


It obviously depends on the type of the book and the reader's expectations. It just might not have been the type of book where you write about things like that.


>It just might not have been the type of book where you write about things like that.

I think that'd be more of the author's choice, instead of the editor.


> Your editor was very fucking wrong.

The editor is completely right in what they were saying. You just want them to be wrong, because you'd prefer to live in the fantasy world where they are wrong.

Let's say you go to get a surgery. You don't want the doctor to tell you about all the times they fucked up and what the awful consequences were. It doesn't matter that they're probably a better surgeon now, having learned from their mistakes. Psychologically, you need that person with the sharp tool poking around inside your body to be a superhuman.

To a lesser degree, the same is true for any expert. Of course everybody makes mistakes. Notice the de-personalization in the word "everybody". You can talk about the mistakes everybody makes, or those ones that many people make. If you talk about your own mistakes however, you lose the superhuman status. There may be a few situations where that somehow helps you, but not when you want to sell books.


I sure do want that surgeon to have been presented/instructed on the common ways that surgeons before them have made mistakes and how to avoid or overcome them.

That they won’t tell me (the patient) is quite a different question from whether they got the material from someone more experienced in their primary or continuing medical education.


The situation is different, the psychological effect is not.


> Let's say you go to get a surgery.

Shouldn't that be more like "Let's say you're learning to be a surgeon."?

For that situation, the person they're learning from discussing problems they hit and how they solved them does sound like it would be very useful.


> Shouldn't that be more like "Let's say you're learning to be a surgeon."?

No. It is an exaggerated example, because I am illustrating a psychological effect.

The point is that you are consulting an expert, and you don't want that expert to tell you about all their mistakes, because your mind is now occupied with doubting that expert. It's not rational.

> For that situation, the person they're learning from discussing problems they hit and how they solved them does sound like it would be very useful.

It's useful to present common mistakes, but mentioning who made the mistake is both irrelevant and damaging.


>The point is that you are consulting an expert, and you don't want that expert to tell you about all their mistakes, because your mind is now occupied with doubting that expert. It's not rational.

Sometimes books are written for entry level people (to the technology or topic) and they're written in an authoritative tone. But I work with people who write books. And I read their books. I don't read them like some kind of expert oracle distributing the blessed texts. I read them as a work of documentation from a friend or colleague, so I don't really feel this veneer of expertise is necessary.

Sometimes the experts try hard to scrape away the veneer of expertise by adding side fluff like in Learn you a Haskell for Great Good. Or, Alex Crichton, who is undoubtedly an expert in Rust and compiler technology giving this talk where he keeps saying things like "it does a process called quasiquoting which doesn't make any sense to me", "I don't know what hygiene means"

https://www.youtube.com/watch?v=g4SYTOc8fL0


> But I work with people who write books. And I read their books. I don't read them like some kind of expert oracle distributing the blessed texts. I read them as a work of documentation from a friend or colleague, so I don't really feel this veneer of expertise is necessary.

Yes, you feel that way. You feel like you're a rational person who values humility, whose judgement isn't clouded by the effects that I am describing.

Well, maybe you really are that person. It doesn't matter. We cannot assume that the buyers are such noble beings.


> The point is that you are consulting an expert, and you don't want that expert to tell you about all their mistakes, because your mind is now occupied with doubting that expert. It's not rational.

In the situation where I'm consulting an expert - hired to fix a problem or find a solution - then I'd tend to agree with you.

But when I'm reading text from a person in my field in a subject area of interest (eg learning from a book), then I definitely want to hear of things that went wrong, and how they fixed them.

I think they're two different situations, and the one you described above isn't the correct one for a person who's bought a book to learn from.

It could just be us valuing different things in our learning though. :)


A patient is not consulting an expert. A medical student is. Who would be very keen to hear some war stories.


I wonder if we're on the same page. Should a book on a programming language discuss compilation errors and how to interpret them? Should a book like Effective C++ (afaik the most popular series of books on C++) exist?


The answer (yes) is already in my reply, in the second paragraph.


The book is not presenting to the patient, the book is presenting to another surgeon. If I was a surgeon, I would find value in learning about previous mistakes, and thankfully in programming as a general rule the cost of error is lower (but still a valuable learning).


"The book is not presenting to the patient, the book is presenting to another surgeon"

I like this analogy and I agree that the previous person saying the publisher was right, completely missed the point.


Found the editor!


Most books frame this as "one common mistake is ...". I always understood this phrase to mean that the author has personally made this mistake more than once.


Agreed. The entire expertise of programming is based on the willingness to find these sorts of issues and fix them. Over and over, for your entire career.


Obviously, I agree with you!

I also think this is the way the majority of tech books are written.

Can you think of another where the author goes from mistake to mistake and then finally gets it right?

I believe there is a space in tech writing for this kind of writing, but it is not something traditional book publishers believe.

This was an O'Reilly book by the way, with really good editors and a really good editoral process.

That editor was right most of the time, IMHO.


>Can you think of another where the author goes from mistake to mistake and then finally gets it right?

Not a book, but Raymond Hettinger often presents in this way and it's fantastic: https://www.youtube.com/watch?v=wf-BqAjZb8M


Most (in-depth, technical) security books follow this sort of thought process- I think. “A big hunters diary” comes to mind.


Ignition! A history of liquid rocket propellants.

Little Lisper, Little Scheme is a conversation style that discusses a lot of false starts.

{Coders|Founders} at Work has a lot of frank talk about early mistakes people make and their pivots.


s/tact/tack/

Tact means skill in dealing with other people, particularly in sensitive situations.

Tack has to do with sailing boats into the wind, and crucially "changing tack" means changing direction.


Thanks, I'm glad I had you as an editor this time! I never noticed that before.


I hate this so much. But it's so common.

I was told not to mention the caveats, instead, render a confident image for the team many times in my career.

It's like doctors suggest their patients to use some drugs without mentioning any side effects.

It reminds me of the video[0] which asks the developer to draw red lines with blue ink, while the project manager keeps pushing the developer like "You're the expert, of course you can do draw red lines with blue ink!".

[0] https://www.youtube.com/watch?v=BKorP55Aqvg


> asked me to remove any writing about mistakes and changes I made in the project for each chapter. I had a bug that appeared and I wanted to write about how I determined that and fixed it.

Your editor is right. Not because this information is not interesting.

But it is distracting and off-topic to go off a tangent every now and then. Especially for someone just learning the stuff, it will confuse them more than anything else.


years (and years) ago, if I wanted to learn a new computer technology or language, I would pick up a book and learn it.

One language that didn't go how I expected it was applescript.

The book on applescript (from o'reilly I believe) was necessarily different.

Instead writing "normal" top-down programs, applescript hooks into the OS from the side.

Many books can concentrate on what you can do, but the applescript had to do everything by example. This is because all of apple's applications expose interfaces that you have to figure out on the fly.

I had to learn a different way from that book, because it necessarily concentrated on "how" instead of "what".


personally... I'd like to see that sort of info, but typically not in the middle of a chapter/section. make a note or call out in the text pointing to a section on why/how you got to the 'correct' position. That info is often helpful info, but can disrupt the flow of the 'good' information.


In my opinion, the most common mistake is not in the article : using kubernetes when you don't need to.

Kubernetes has a lot of pros or the papers but in practice it's not worth it for most small and medium companies.


This was my thought exactly. The article is great assuming you need to use k8s, but does leave out the important question: does your project or product require k8s and all the overhead it unavoidably entails?

Amazon's Elastic Container Service (ECS) on Fargate deployment type is probably a better option much of the time. Until you maintain your own k8s cluster (including the hosted variants on AWS, GCP, etc.) you might not realize how complex configuring k8s is.

While the AWS's ECS service may be more limited, I've found it leaves less room to do the wrong thing. Unfortunately, the documentation on ECS and the community support is inferior to k8s, but I'll accept that if I don't have to spend a whole day researching what service mesh to use with k8s and how to configure a load balancer and SSL certs.


You are right, but talking about whether to use k8s or not would make the article 2-3x longer. It is a totally different topic.

ECS (with Fargate or EC2) is awesome but one might argue that it is still an overkill for a lot of people.

Some people look for scaling, immutability, effective resource utilization and they think they NEED containers. That is not entirely true. A lot of people would do totally ok with one EC2 instance.

Either way, kubernetes is awesome, there is a big community behind it maintaining a lot of tooling, docs, how-to guides, etc. That is very valuable for a lot of people too :)


> You are right, but talking about whether to use k8s or not would make the article 2-3x longer. It is a totally different topic.

Sorry—my comment was written poorly. I think the article is good as is and agree k8s or not is a different topic. My comment intended to be a reminder for all of us reading it that we should ask more questions before choosing a technology.

I’ve found a trend in engineering (including with myself), we sometimes forget to take a step back and see the forest.


Unless something has changed recently, Fargate is a terrible choice for something that you want up 24x7. IIRC it’s like 2x the cost of similar on demand EC2 instances. It’s better for discrete tasks that can’t run in Lambda because timeout or possibly peak load (although EC2 is probably just as easy if not easier to scale).

But I generally agree with the premise that ECS is a better starting point for people just dipping their toe into containers.


2x the line item cost but generally much cheaper in terms of Total Cost of Ownership.


Most of the time you don't even need a cluster of any kind. We live in a world where you can spin up a server with 3TB of RAM and 128 cores whenever you want, and it will cost you less than a senior developer.


https://aws.amazon.com/ec2/pricing/on-demand/

x1e.16xlarge, 64 CPU, 2TB of RAM, $10k per month.


So if you pay a senior developer less than 120k then the parent is right.


It costs the employer a lot more than just the salary to have a salaried developer.


What do you do when that one machine goes down?


Are there good resources for making that decision according to good criteria?


A simple rule of thumb is the number of services you have. For instance, my current employer has been working on getting everything on Kubernetes for the last year or so and we have two services... the frontend server-side renderer, and API.


/snark So is that a cluster per app or a namespace per app?


The criteria is basically: do you have the time, people, expertise and willingness to build a custom dev platform on top of Kubernetes?

If yes, Kubernetes might make sense for you, if your company is large enough that a PaaS like Heroku is prohibitively expensive.

If no, you don't want Kubernetes.

https://twitter.com/kelseyhightower/status/93525292372179353...


Do you also include managed kubernetes offerings, such as from Digital Ocean, in that assessment?


Honestly my experience has been that managed k8s is often more complicated from a developer perspective than just k8s - sure, you don't have to deal with setting it up, but you have to figure out how all the 'management' and provider-specific features work, and they often seem pretty clumsily integrated.


And in many cases features that would help your use case aren't enabled on that platform, or are in a release that's still a year or two from being supported on that platform... I'm looking at you, EKS.


I really wish cloud providers would offer a managed Docker Swarm service - it's so much simpler than k8s, but not having to manage the underlying machines would be great.

I might be remembering wrongly, but I think Azure actually used to offer multiple managed orchestration platforms, including DCOS and Swarm, before they went all-in on k8s.


Oh yes. Managed kubernetes is full of various issues. Some major cloud providers sell very poor managed kubernetes.

If someone knows about a reliable managed kubernetes, please let me know.


Can you elaborate on the cloud providers selling poorly managed k8s - are they all problematic? I have no experience with cloud provided k8s.


Not the OP but yes if cost is a factor. As far as I know no managed K8S offerings are cheap.


- DigitalOcean offers free clusters.

- Azure (still?) offers free clusters.

- GCP offers one free single-zone cluster.

If you need more than DO can offer in terms of compute instances, you can probably afford GKE/EKS, which is around 75$/month.

---

Running highly-available control plane (K8s masters & etcd) by yourself is NOT cheaper than using EKS.

To achive high availability, EKS runs 3 masters and 3 etcd instances, in different availability zones. Provisioning 3 t3.medium instances (4 GB of memory and 2 CPUs) would cost the same as a completely managed EKS.

Not to mention the manual work you need to setup, maintain and upgrade such instances.


Well, nobody asked me, and I'm no expert, but here's my list of what (not) to do in Kubernetes (if I had the authority).

1. There. Is. No. Machine. (Insert matrix meme here.) Before you open up your cluster to the rest of company, drill it down to them. Maybe even create a Google Form where they have to sign "I hereby acknowledge that there is no machine in k8s and any attempt to tie my job to a particular machine means a broken config by definition."

2. Thanks to 1, don't let anyone use hostNetwork, hostIPC, hostPID, hostPorts, host whatever, unless you have a really good reason to (with explicit approval process).

3. Don't let anybody start a job without memory/CPU limit. Make sure they understand that, if the job goes over the memory limit, it dies, and it's not k8s admin's problem.

4. You can't log anything into the pod - when the pod dies the log is gone. You can't log into the machine, either (see 1). Therefore, you really need some kind of logging framework that takes the log from your pod and saves it, in its raw form, somewhere safe (like S3). I don't know if there's any such framework, but there had better be.

5. Make sure every manual operation is logged (who did what to which job when), unless you like asking "@here Does anybody know who owns fooservice?" every month.

6. Kubernetes is not magic: if it takes thirty minutes to provision your service, fix that, instead of moving thirty minutes of manual provisioning into k8s and somehow expect it to be magically reliable.

7. Don't bring in existing dependencies uncritically. If your job connects to a zookeeper server to find out its peers, don't bring it into k8s, but rewrite it to use k8s service instead.

8. Take extra extra care when writing down your first job specification, because there are a lot of yaml files to write, and people will just copy what's already there. If your first k8s job mounts host /tmp directory just because you were testing something and forgot to delete the line, soon you will have fifty jobs all mounting host /tmp directory. Good luck figuring out which job actually needs it then.

Yeah, again, I'm by no means an expert - I'm not even an admin, so just consider the list as a rambling of some poor soul who has seen some stuff. Here be dragons, have fun.


> 8. Take extra extra care when writing down your first job specification, because there are a lot of yaml files to write, and people will just copy what's already there. If your first k8s job mounts host /tmp directory just because you were testing something and forgot to delete the line, soon you will have fifty jobs all mounting host /tmp directory. Good luck figuring out which job actually needs it then.

This piece of advice is so underrated, it's hard to put into words. Every senior developer/architect should know that.


4. Centralized logging is the basics in any company. A container simply logs to stdout (kubernetes) and a fluentd/logstash agent can forward it.

7. Sadly if existing software can't run in kubernetes, this severely limit the benefits of having kubernetes, why use something that can't be used? If the jobs already have service discovery, they might be better off running on hostNetwork or whatever allow them to work as is.


Re: 7, if you use a separate discovery like zookeeper, then it will have its own idea of which jobs are alive and ready, and k8s will have its own idea of the same thing - they may not necessarily agree with each other. They don't even know the existence of each other. Maybe it would work, but it seems like more hassle than rewriting the offending part.

From what I've seen, Kubernetes is a quintessential Google product - it has a very particular idea of how jobs should be run, and the farther you stray away from it, the more it will cost your sanity. If that curtails the benefits of k8s for you, then yes, I think you should revisit whether you really need k8s, based on that limitation.

Just my two cents.


If we're talking about zookeeper for service discovery. The registering application can maintain a connection to zookeeper and ping every few seconds. So zookeeper has a very good idea of what's running or not, usually more accurate than kubernetes.

Actually, zookeeper has API to register only when ready and to listen to events/changes. Not sure kubernetes has equivalent stable API so might be hard to port over.

A rewrite might be a solution, except it's not because it's doomed to fail. We're discussing service discovery, which implies multiple clients and servers, probably managed by different teams and written in different languages. The odds of completing a coordinated rewrite effort are abysmal. ^^

Well. I am thinking out loud. It's a real problem my company was facing. We've got kubernetes clusters that are supposed to run applications and we've got apps using zookeeper that can't run in kubernetes because it breaks the service discovery. I will eventually have to hint people how to make it happen, after one year of kubernetes hardly going anywhere.


These are all excellent ideas and honestly better than the article. You're being too modest.

We solve #2 (and more) by having a highly restrictive default PodSecurityPolicy. We started with a combination of GCE and OpenShift default examples (although we use neither) which are published on github. PSP lets admins and security relax a little at night.

Anything #3 is solved by setting highly restrictive LimitRanges in each namespace that must be overridden in the deployment specs.

It's not mandatory that you override the defaults, but if you don't you're going to get poor performance. If you're fine with that the admins are also fine with that.

Sometimes some occasional throttling during spikes is totally acceptable - it all just depends on whether the app actually needs maximum performance. There are many other apps in the cluster that DO, and in the grand scheme of things they benefit by having the ones that DON'T get throttled.

Apparently Borg handles CPU differently (and better) than k8s in the multitenancy model, but this is the best "poor man's" borg I can come up with and it works for us.


Great post! If you're in the Kubernetes space for long enough, you'll see all of these configuration mistakes happening over and over again.

I've created a static code analyzer for Kubernetes objects, called kube-score, that can identify and prevent many of these issues. It checks for resource limits, probes, podAntiAffinities and much more.

1: https://github.com/zegl/kube-score


Excellent tool. Can this analyse the result from kustomize files rather than actual k8s YAML?


Yes, kustomize is not supported natively, but you can achieve effect by piping the kustomize output to kube-score.

    kustomize build | kube-score score -


Thanks. One more question. For Visual Studio Code, Microsoft has a plugin called Kubernetes - which I currently use. Have you done a comparison against that?


I actually disagree with the first recommendation as written - specifically, not to set a CPU resource request to a small amount. It's not always as harmful as it might sound to the novice.

It's important to understand that CPU resource requests are used for scheduling and not for limiting. As the author suggests, this can be an issue when there is CPU contention, but on the other hand, it might not be. That's because memory limits are even more important than CPU requests when scheduling: most applications use far more memory as a proportion of overall host resources than CPU.

Let's take an example. Suppose we have a 64GB worker node with 8 CPUs in it. Now suppose we have a number of pods to schedule on it, each with a memory limit of 2GB and a CPU request of 1 millicore (0.001CPU). On this node, we will be able to accommodate 32 such pods.

Now suppose one of the pods gets busy. This pod can have all the idle CPU it wants! That's because it's a request and not a limit.

Now suppose all of the pods become fully CPU contended. The way the Linux scheduler works is that it will use the CPU request as a relative weight with respect to the other processes in the parent cgroup. It doesn't matter that they're small as an absolute value; what matters is their relative proportion. So if they're all 1 millicore, they will all get equal time. In this example, we have 32 pods and 8 CPUs, so under full contention, each will get 0.25 CPU shares.

So when I talk to customers about resource planning, I actually usually recommend that they start with low CPU reservation, and optimize for memory consumption until their workloads dictate otherwise. It does happen that particularly greedy pods are out there, but that's not the typical case - and for those that are, they will often allocate all of a worker's CPUs in which case you might as well dedicate nodes to them and forget about how to micromanage the situation.


If you ask for 0.001 CPU share, you might get it. I would advise caution. You that pod gets scheduled on a node with another node that asks for 4 CPUs and 100MB of memory, it's not going to get any time.


It depends. If the second pod requests 4 CPUs, it doesn't necessarily mean that the first pod can't use all the CPUs in the uncontended case.

A lot of this depends on policy and cooperation, which is true for any multitenant system. If the policy is that nobody requests CPU, then the behavior will be like an ordinary shared Linux server under load - the scheduler will manage it as fairly as possible. OTOH, if there are pods that are greedy and pods that are parsimonious in terms of their requests, the greedy pods will get the lion's share of the resources if it needs them.

The flip side of overallocating CPU requests is cost. This value is subtracted from the available resources, making the node unavailable to do other useful work. Most of the time I see customers making the opposite mistake - overallocating CPU requests so much that their overall CPU utilization is well under 25% during peak periods.


Most people would be thrilled to get anything close to 25% CPU util. I guess one of the big missing pieces fro Borg that hasn't landed in k8s is node resource estimation. If you have a functional estimator, setting requests and limits becomes a bit less critical.


1000% agree. Former employer had a proprietary app scheduler that worked like this. We would frequently tell users to request as little CPU as possible. Extra CPU would be shared, but if you made an unreasonable request you’d never get scheduled in the shared environment.


I agree! I assumed not to trust anyone with all the greedy pods they can schedule.

The example you are describing is probably not super common but I will try to rephrase my blogpost so that it reflects this comment:)


Sorry, what is not super common? With my customers I rarely see incidents due to CPU starvation of a pod in their K8S clusters.


Great article, I've learned many of these firsthand and agree with their conclusions. I have some more reading to do on PDBs!

K8s is a powerful and complex tool that should only be used when needed. IMO you should be wary of using it if you're trying to host less than a dozen applications - unless it's for learning/testing purposes.

It's a complex beast with many footguns for the uninitiated. For those with the right problems, motivations and skillsets, k8s is the holy grail of scale and automation.


What would you suggest as an alternative, simpler form for docker deploy, running and managing? Docker-compose?


What do you mean by docker hosting? Kubernetes (and other related tools) are container orchestration/management tools. As if often the case in the management space, if you're just running at small scale, you may not need anything beyond container command line tools and some scripts. You could also use Ansible to automate.


Thanks we are running a few node servers, we now deploy command line. Dev we use docker-compose. But we are looking for a way to easily share our servers. We developed it for Amsterdam open source. Around 20-30 cities are in line to start using it. Doesn't have to be scalable, or have high availibility. Ease of deployment, easy way to update and basic security. All sysadmins are pushing for kubernetes, although for the big cities it makes sense, it really starting to feel like an overkill for small cities who will run 1-3 non-critical sites with 0.5-5k users p/m. Heard a lot about ansible, will look into it, thanks!


So it sounds as if you've sort of outgrown the command line but aren't sure you want to jump in on self-managed Kubernetes. You'd have to look at the costs but maybe some sort of managed offering would work for you. It could scale up for larger sites but would be fairly simple for smaller ones--especially with standardized configurations.

You could look at the big cloud providers directly. OpenShift [I work at Red Hat] also has a few different types of managed offerings.


I'm not necessarily agreeing with you. Kubernetes really is a complex beast, and I wouldn't recommend self-hosting it for companies that don't have people that can focus solely on managing it.

I would also not recommend it for hosting a WordPress site, or a simple CRUD app.

However, when you get to the level where autoscaling is required, and where you are deploying multiple services, managed Kubernetes is not such a bad idea.

Using EKS (especially with Fargate) on AWS is not much harder than figuring out and properly utilizing EC2/ASG/ELB. GKE or DigitalOcean offerings seem to be even easier to use and understand.


The guaranteed QoS example in the article is wrong. Kubernetes only sets the Guaranteed QoS if the CPU count is an integer (which 0.5 is not).

Also, to take full benefit of the QoS you need to configure the Kubelet with "--cpu-manager-policy static"[0].

[0]: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-...


Thanks to pointing that out! I will edit it.


Shameless plug but on topic. I wrote recently about readiness and liveness probes with Kubenetes. If you look for an educational perspective you can check: https://medium.com/aiincube-engineering/kubernetes-liveness-...


    ...
    requiredDuringSchedulingIgnoredDuringExecution:
        ...
This instantly remembered me of this: https://thedailywtf.com/articles/the-longest-method

Kubernetes sometimes shows its Java roots.


> its Java roots

Citation needed.

AFAIR Borg is implemented in C++ and k8s has been implemented in Go from day 0.

Am I missing some crucial steps in k8s's history?


Apparently the first version was based on a Java prototype, but it's unclear to me how much that's visible:

https://kubernetes.io/blog/2018/06/06/4-years-of-k8s/

> Concretely, Kubernetes started as some prototypes from Brendan Burns combined with ongoing work from me and Craig McLuckie to better align the internal Google experience with the Google Cloud experience. Brendan, Craig, and I really wanted people to use this, so we made the case to build out this prototype as an open source project that would bring the best ideas from Borg out into the open.

> After we got the nod, it was time to actually build the system. We took Brendan’s prototype (in Java), rewrote it in Go, and built just enough to get the core ideas across


I remember a comment from someone on the go team (bradfitz maybe) saying that the initial plan for Kubernetes was to be implemented in java, but go was chosen because it already had a lot of momentum in the container world.


To be fair, I don't see how this could be shorter in any other language without losing readability.


You don't have to. Kubernetes should just say "required" here, and the documentation should say "Warning: this is only checked during scheduling."

printf isn't named printIntoBufferWhichMayNotFlushUntilLinefeed, and people are fine with it.


I believe there may have been a plan to allow checks during runtime at some point; although this feature is no longer necessary since https://github.com/kubernetes-sigs/descheduler#removeduplica... can do it.


They could have used: soft, hard and strict with the documentation explaining the differences.


That Java link in the article goes to a completely not-Java Chinese site btw


And the Spring link is a 404


I'm glad readers are liking the article, but please read and follow the site guidelines. Note this one: If the title begins with a number or number + gratuitous adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."

The submitted title was "10 most common mistakes using kubernetes", HN's software correctly chopped off the "10", but then you added it back. Submitters are welcome to edit titles when the software gets things wrong, but not to reverse things it got right.


Oops, sorry. I thought I made a typo that is why I corrected it. Thanks for pointing that out.


Really nice article, it shows a lot of the small details you have to take into account to go from "deploying Kubernetes" to "deploy a production-grade Kubernetes" where production means some real, trafficked site.


"more tenants or envs in shared cluster"

This is what I'm trying to convince my current company about. They want everything in a single cluster (prod, test, stage, qa).

Of course self hosting makes this more difficult to justify, since it is additional expenses for more machines.


Have you considered using OpenShift instead of Kubernetes? It comes with vastly improved multitenancy features, as well as other aspects, in regards to plain Kubernetes. OKD, the open sourced package of OpenShift allows full self-hosting: https://www.okd.io


OpenShift comes with its own headaches from my understanding. And we are too deep into Kube to switch now.


That sounds like a disaster waiting to happen!

Seems like a perfect use-case for Cluster API: https://cluster-api.sigs.k8s.io/user/quick-start.html

Have one global "mgmt cluster" with several workload clusters


You bet it is! Haha

Ah interesting! We'll have to look into that.


What would a good liveness and readyness probe do for a rails app? What kind of work and metrics would these 2 endpoints do in my app?


This is a good question, and I think the article doesn't cover this topic well.

From the article: > The other one is to tell if during a pod's life the pod becomes too hot handling too much traffic (or an expensive computation) so that we don't send her more work to do and let her cool down, then the readiness probe succeeds and we start sending in more traffic again.

Well... maybe. Is it a routine occurrence that an individual Pod becomes "too hot"? If your load balancer can retry a request on, say, a 503 Service Unavailable, you may be better off relying on that retry combined with CPU-based autoscaling to add another Pod (it's simpler, tradeoff is the load balancer may spend too much time retrying).

If you can't or don't want to add additional Pods, then your client is going to see that 503 (or similar) anyway. I'd say, then, that the point of a Pod claiming it's "not ready" to get itself removed from the load balanced pool is to allow the load balancer to more quickly find an available Pod, but this adds complexity and may be irrelevant if you run enough Pods to have some overhead capacity.

A Rails app is a bit different from a node/go/java app in that (typically at least, if you're using Unicorn or other forking servers) each individual Pod can only handle a limited number of concurrent requests (8, 16, whatever it is). It's more likely then that any given Pod is at capacity.

But, liveness/readiness are not so simple. If these probes go through the main application stack, then they're tying up one of the precious few worker processes, even if only momentarily. I haven't worked with Ruby in a number of years, but I remember running a webrick server in the unicorn master process, separate from the main app stack, to respond to these checks. But I did not implement a readiness check that tracked the number of requests and reported "not ready" if all the workers are busy.


If it has network problems, kubernetes can take out that instance out of serving traffic.

When your doing rolling upgrades it can signal your app is ready to take traffic.

Those are the main uses.


For the readiness probe a simple endpoint that returns 200 is enough. This tests your service’s ability to respond to requests without depending on any other dependencies (sessions which might use Redis or a user auth service which might use a database).

For liveness probe I guess you could check if your service is accepting TCP connections? I don’t think there should ever be a reason for your service to outright refuse connections unless the main service process has crashed (in which case it’s best to let Kubernetes restart the container instead of having a recovery mechanism inside the container itself like supervisord or daemon tools).


> For the readiness probe a simple endpoint that returns 200 is enough. This tests your service’s ability to respond to requests without depending on any other dependencies (sessions which might use Redis or a user auth service which might use a database).

If the underlying dependencies aren't working, can a pod actually be considered ready and able to serve traffic? For example, if database calls are essential to a pod being functional and the pod can't communicate with the database, should the pod actually be eligible for traffic?


The article explicitly warns against that:

> Do not fail either of the probes if any of your shared dependencies is down, it would cause cascading failure of all the pods.

The idea would be that the downstream dependencies have their own probes and if they fail they will get restarted in isolation without touching the services that depend on them (that are only temporarily degraded because of the dependency failure and will recover as soon as the dependency is fixed).


Really great article.

I have used Kubernetes pretty heavily in the past, and didn't know about PodDisruptionBudget.


One thing to watch for with pod antiAffinity - if you use required vs preferred, and your pod count exceeds the node count, the remainder will be left in Pending and won't spin up anywhere.


There's a new feature which does a better job of spreading Pods without blocking scheduling quite as badly: https://kubernetes.io/docs/concepts/workloads/pods/pod-topol...


> You can't expect kubernetes scheduler to enforce anti-affinites for your pods. You have to define them explicitly.

Why isn't this the default behavior? Why don't I have to go in and tell it that it's okay to have multiple instances on the same node? Why? So that I somehow feel like I've contributed to the whole process by fixing something that never should break in the first place?

I know of a few pieces of code where I definitely want to run N copies on one machine, but for all of the rest? Why am I even running 2 copies if they're just going to compete for resources?


Simply put, what the article recommends is in most situations, dead wrong advice.

If configured as suggested, and for some reason you lose enoguh nodes in your cluster for not having single node for each of your replicas, you will have less replicas than you intended, since the scheduler can't schedule a pod to a node where identical pod is already running.

Additionally, the affinity and anti affinity features are costly from the cluster perspective, so the configuration recommended by the author cost you performance.

And why isn't Kubernetes doing the obvious thing and spread apart your pods? well, it's simple - it does the right thing:

https://v1-15.docs.kubernetes.io/docs/concepts/scheduling/ku...

The first scoring parameter is SelectorSpreadPriority.


It's quite possible that you have a machine with 192 CPU cores in it, but it's very unlikely that you are able to write a service that scales to that level ... and if you write it in Go it's really unlikely that you can scale even to 8 CPUs. There's nothing weird about having multiple replicas of the same job on the same node. If you look through the Borg traces that Google recently published you can find lots of jobs with multiple replicas per node.


This is not how defaults work.

When you are talking about the realm of the possible, you provide settings that allow you to reach the scenarios that you feel are reasonable, desirable, or lucrative (or commonly enough, some happy combination of the three).

Defaults are the realm of the probable. And nobody is requisitioning a 192 core machine without a good bit of due diligence, which would include deciding how to set server affinity.


You're suggesting that preventing multiple replicas of the same job to schedule on the same machine as a good default. There's no evidence to support your conclusion, and my experience it quite the opposite. It is much better if people running batch jobs just schedule 100000 tiny replicas, and let the scheduler sort it out. This provides the cluster scheduler with plenty of liquidity. Multiple small processes are more efficient than a shared-nothing single process.


Still the same question.

Do you think that batch processing is the default activity in Kubernetes, or something that people find after they are familiar with the system?


Yes, I think batch workloads are the most common workloads, in resource-weighted terms, among k8s users.


You're being slippery, which comes across as dishonest.

Why does the resource weight have anything to do with the choice of defaults? Settings don't care how often they are read, they only care how often they are set. Large jobs use a disproportionate amount of total resources, sure, but they are tiny uptick in total configuration.

The stakes are higher, but so is the 'budget' for getting things right. I can deploy 5 servers and just wait to see what happens. If I'm doing an overnight job to process a billion records, I'd better be doing some due diligence beforehand, or I have nobody to blame but me. And the failure mode here is that I didn't spend money fast enough to get the job done.

With the current defaults what happens is I blow my monthly budget in one night. Which is very convenient for the vendor, but not convenient for my company.

"It is difficult to get a man to understand something when his salary depends upon his not understanding it." - Upton Sinclair


Pod anti affinities did historically dramatically increase scheduling times. Not sure this is the primary reason, but probably one


TBH, for me it's usually "Using Kubernetes."

Maybe on GCP (I don't see a lot of companies on GCP) it makes sense, but ECS is AWS native and on bare hardware I immediately go to docker swarm since it ships with the container runtime (instead of a bolted on sidecar container thing).

I like the primitives and features of Kubernetes, but the implementation doesn't give me warm fuzzies and it always gets passed over for safer bets for me. Even very early on in it's development I always went to Mesos over Kubernetes (though Mesos is P dead at this point).


Don't use Kubernetes unless you Know you need it.


Oh I thought this was

Common mistakes: using Kubernetes


Lots of good advice in this article.


I think there is more to the story for some of these points and it can be dangerous to just take this at face value of best practices.

For example on the liveness / readiness probe item, the article says,

> “ The other one is to tell if during a pod's life the pod becomes too hot handling too much traffic (or an expensive computation) so that we don't send her more work to do and let her cool down, then the readiness probe succeeds and we start sending in more traffic again.”

But this is often a very bad idea and masks long term errors in underprovisioning a service.

If the contention of readiness / liveness checks vs real traffic is ever resulting in congestion, you need the failure of the checks to surface it so you can increase resources. If you set things up so this failure won’t surface, like allowing the readiness check to take that pod out of service until the congestion subsides, you’re only hurting yourself by masking the issue. It basically means your readiness check is like a latency exception handler outside the application, very bad idea.

The other item that is way more complicated than it seems is the issue about IAM roles / service accounts instead of single shared credentials.

In cases where your company has an enterprise security team that creates extremely low-friction tools to generate service account credentials and inject them, then sure, I would agree it’s a best practice to ruthlessly split the credentialing of every application to a shared resource, so you can isolate access and revoking.

But if you are on some application team and your company doesn’t have a mature enough security tooling setup managed by a separate security team, this can become a bad idea.

It can lead to superlinear growth in secrets management as there will be manual service account creation and credential propagation overhead for every separate application. Non-security engineers will store things in a password manager, copy/paste into some CI/CD tool, embed credentials as ENV permanently in a container, etc., all because they can’t create and maintain the end to end service account credential tools in addition to their job as an application team engineer. It’s something they think about twice per year and need off their plate immediately to move on to other work.

Across teams it means you end up with 20 different team-specific ways to cope with rapid growth of service accounts, leading to an even worse security surface area, risk of credential-based outages, omission of important testing because ensuring ability to impersonate the right service account at the right place is too hard, etc.

Very often it is a real trade-off to consider that one single service account credential that has just one way to be injected for every service is safer in the bigger picture.

Yes it means a credential issue for any service becomes an issue for all, and this is a risk and you want automated tooling to mitigate it, but it very often will be less of a risk than insisting on a parochial best practice of individual service account credentials, resulting in much worse and less auditable secrets workflows overall unless it is completely owned and operated by a central security team in such a way that it doesn’t create any approval delays or workflow friction for application teams.


You of course should monitor the rate of liveness flapping for your services. The need to monitor it does not imply that it's a bad feature.


You can’t have it both ways. If you need to monitor it and take corrective action (which you do) then you shouldn’t rely on it.

This is an argument for making your liveness probe == readiness probe. It should just check pod availability in a minimal way, and if continuing to send the pod traffic based on this indicator turns out bad because of congestion, you want to see that causing errors and react, not let the scheduler take it out of service for new traffic.

You want liveness & readiness to check the same thing, and it should be a non-trivial check of service health that is also very low latency. And as long as that check is passing, keep sending traffic.

When the check fails, it should always be for a “hard down” reason that tells you the pod could not, regardless of traffic levels, accept traffic because it’s fundamentally internally down.


I don't want the pager to go off just because of some slight non-liveness. That's a likely outcome of high utilization (usually viewed as a good thing, isomorphic with low cost). If you're running really hot and a few tasks are shedding load by playing dead intermittently, that's OK up to a point; if a large portion of pods are doing that at a high rate, that might be bad. You might not even alert on it, just throw it up on a dashboard as informative indicator for operators.


> “ I don't want the pager to go off just because of some slight non-liveness.“

That’s just bad engineering. Really, one should want the pager to go off for that and be really pedantic to actually sniff out the root cause and actually fix it.

Hiding that type of issue by letting something like liveness/readiness policy tacitly conceal it is just going to result in a far worse or more systemic issue later with far worse pager disruptions to your life.

You’re skipping flossing every now and then only to need serious root canals later.


[flagged]


You might be joking but I'd say using k8s when you don't need to, is definitely a mistake.


You spent the time signing up just for this?


It’s an important point for those who may unknowingly over engineer.


And they are probably not going to change their mind by reading an anonymous sarcastic comment on HN.


For someone with a short time scale, only trawling this thread, of course not.

For someone young in this space, this comment and one hundred others -for and against- sift into hiso'er consciousness as part of the perceived zeitgeist of kubernetes within the larger community. Perhaps after several years, this person may have an intuition to avoid kubernetes in favor of separate docker or lxd containers.

That association of kubernetes as a FAANG-level tool builds stronger with the linked article: this hypothetical person can compare the struggles against perceived resources and so on. But not everyone has time to read the article, any given kubernetes article, that we come across. Some of those times, it's enough to take the temperature and move on. So, -over years- that may build an aversion that would not have otherwise formed had commenters avoided denouncing (or endorsing) kubernetes with less than full commitment toward convincing others.

Also, in this toneless medium, I can't intuit much of the emotional weight lolkube conveyed the sentence with. Was this person rueful, playful ?


But there are several other actually useful comments in this thread warning about using kubernetes when it's not needed, that cite actual reasons. We didn't need one by "lolkube", not at any timescale. By the way, that name is the clue you need about the tone.


This also needs a companion post called "common mistakes: using kubernetes".

I feel like it's a weekly occurrence now where I hear of a startup launching their mvp on kubernetes having spent 8 months too long on Dev as a result.

The other day in an interview someone bragged to me how he had convinced his team to spend 12 months moving to K8s. Upper management thought it was a waste of time but eventually agreed. I asked him if there were any measurable benefits and he said no.

I totally understands why Google needs it. Do you?


Yes we need it.

This is absolutely becoming a tiresome trope. K8s is a huge benefit to tons of companies and none of them are Google.

Yes some people are using K8s when they don't need it. Just like many are using cloud managed services when they don't need them. Or vms. Or insert any technology here.

This article has nothing to do with whether K8s fits some particular use case but may be of help (although I disagree with the entire section on resources which reflects a lack of long term experience with K8s in production) to those who do want to use it.

You're the 10th person in this thread saying the same thing and it doesn't appear you even have that much experience with operations in general.

Sorry to go off on you but I'm really seeing these types of tropes and quick depthless one liner comments and offtopic snipes lately as the downfall of the hn comment section.


Kubernetes doesn't benefit Google? How not?


I think the intention of the post you replied to was to say that many companies other than google have a legitimate need for kubernetes.


Even Google doesn't need it that much: back when I was there, each Borg cluster had something like 10,000+ cores. Large enough to run a typical SV startup wholescale. The ratio of "cluster management work" vs. "actual work being done on it" was not that high.

These days, some people are like "Dude, if you don't have one cluster per AWS availability zone per each environment, you're doing it wrong." Why, just why.


It misses the biggest one: using it. I ranted about the cloud a decade ago http://drupal4hu.com/node/305 and there's nothing new under the Sun. Still most companies doing cloud and Kubernetes doesn't need it... practice YAGNI ferociously.


I think you may be on the wrong side of history here


Nothing new with that one. I still think git was the wrong choice for DVCS and yet, I have been using it since for a decade or more now. I am still still feeling I have an uneasy truce with it but not a friendship. I still think github is a shitty choice for hosted git -- at least most large open source projects have went with gitlab so I am not utterly alone with that. I am using now Kubernetes because my primary client is using it. Doesn't mean I am happy with it or that I think it's necessary by any means. It's fine. I am getting old but I still can learn. Doesn't mean I can't be grumpy about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: