I don't understand why you would use HAProxy here. Generally on k8s, you'd use a LoadBalancer (backed by eg. metallb on bare metal) to route external traffic into your cluster. This goes for both 'normal' payloads/services and the Kubernetes API endpoint itself.
In addition, it seems like the quorum configuration shown here is pretty poor - only uses two nodes. This is a recipe for split-brain.
Which will be internally backed by exactly the kind of infra the OP is describing.
* MetalLB doesn't work in most cloud environments because the networking is weird.
* MetalLB is extremely new while VIP+Nginx/HAP has been working forever and is the standard choice for implementing a HA LB.
* MetalLB plays the same exact role as what the OP is laying just in a different style that has nothing to do with k8s. MetalLB could have been unicast VRRP for all it matters. The value is the code to instrument the LB from within k8s which can work with any external LB.
In k8s load balancers are inherently external to the cluster and need to maintain their state. It doesn't really matter how this is accomplished but Pacemaker/Corosync is the "off the shelf and supported by your distro with good docs" option.
"Why don't you just use $managed_k8s?" -- Because it defeats the purpose of learning how to administer it.
Sibebar:
Why is everybody treating k8s like it's something special and magical here and needs it's own solution? Your k8s cluster is a group of app servers with a little bit of networking sprinkle so that you can reach every app from any node and some API sprinkle to allow the app servers to instrument external resources.
The k8s cluster needs something outside itself to route traffic to all the nodes. There isn't a way around this. From an infra perspective a k8s cluster is a black box of generic app servers that want to be able to control the LB with an API integration. Nothing else.
> MetalLB doesn't work in most cloud environments because the networking is weird.
Sure, but most cloud environments are handled by other LoadBalancer controllers specific to that environment. See: Kubernetes Cloud Controller Manager.
> MetalLB is extremely new while VIP+Nginx/HAP has been working forever and is the standard choice for implementing a HA LB.
I do not disagree with that. Hell, I've filed/fixed metallb issues myself already.
> MetalLB plays the same exact role as what the OP is laying just in a different style that has nothing to do with k8s. MetalLB could have been unicast VRRP for all it matters. The value is the code to instrument the LB from within k8s which can work with any external LB.
Sure, but the author is not describing it to act as a Kube LoadBalancer, but as a separate entity with its own configuration. So to actually then use Kubernetes meaningfully they still need to bring something that actually fulfills LoadBalancer requests. Might as well use the same code for both cases.
> In k8s load balancers are inherently external to the cluster and need to maintain their state. It doesn't really matter how this is accomplished but Pacemaker/Corosync is the "off the shelf and supported by your distro with good docs" option.
Why? You can easily keep all state and components within k8s, as MetalLB does.
> The k8s cluster needs something outside itself to route traffic to all the nodes. There isn't a way around this.
Sure, that's called BGP to a ToR that can then ECMP-route traffic where needed, and that's what metallb gives you.
> Why? You can easily keep all state and components within k8s, as MetalLB does.
I guess I should have said "its own state." I mean you can shove just about any software into a container and get k8s to run it but the point is that k8s can't load balance by itself without managing an environment specific external resource. Because I would count BGP as using the router in the same way that it would use HAProxy.
Would "load balancers are necessarily on the outside of the cluster" be better phrasing?
Are you sure? All encountered k8s installations (including ones rolled from scratch by me) use Ingress controllers as the option for getting traffic into the cluster. Community NGINX Ingress controller is the de facto standard. One need to use LoadBalancer service type because of its managed origin (metallb is a different beast and I wouldn't recommend it, if you need to load balance traffic to your ingresses on on-premises infra, it's better to do it outside of k8s). Anyway, you lose all flexibility and observability of Ingress solutions with LoadBalancer service types if you use them directly as traffic routers to your backend.
They're complementary - LoadBalancer Services are L3, Ingresses are L4+. More often than not in-cluster Ingress providers (eg. nginx-ingress-controller) will in fact use LoadBalancers to actually route L3 traffic into the cluster in the first place.
Without being able to create LoadBalancer services there's no easy way to get any traffic into your cluster other than using NodePorts, and these have tons of shortcomings.
NodePort is a bad practice for a such case, you are right. You get additional unnecessary routing on CNI level between the cluster nodes just to get traffic into the Ingress controller pod, for example. The solution here is to use hostNetwork mode with the pool of workers solely dedicated to scheduling Ingress controllers. Traffic directly hits NGINX/Envoy/HAProxy/Traefik/whatever and gets into the cluster without additional intermediates. You need a load balancing solution for this pool of Ingress controllers, that's right, but as I said before, this setup gives you the flexibility to cook load balancing as you desire.
BTW, community NGINX Ingress controller is able to ingress L3 (TCP) traffic into the cluster.
> You need a load balancing solution for this pool of Ingress controllers, that's right, but as I said before, this setup gives you the flexibility to cook load balancing as you desire.
Sure, but this means that you cannot use LoadBalancers, which is painful. It means every payload has to be configured both at k8s level and then externally. That somewhat defeats the use of k8s as a self-service platform internally in an organization (other dev/ops teams need to go through a centralized ops channel to get traffic ingressing into the cluster if for some reason they can't use an Ingress).
> BTW, community NGINX Ingress controller is able to ingress L3 (TCP) traffic into the cluster.
Yes, but it's configurable via a single ConfigMap (which limits self-service if you're running a multi-tenant orga-wide cluster, unless you bring your own automation), and still you only have one namespace of ports, ie. a single external address for all ports - unless you complicate your 'external' LB system even further.
With all these caveats, I really don't understand why not just run metallb.
Do you know whether ingress controllers like nginx-ingress-controller honor the readiness status of pods from a service and so only sends traffic to ready pods from the concerned deployment?
I remember seeing documentation on this a while ago, but I can't seem to find any documentation from nginx-ingress that talks about how nginx-ingress uses readinessprobe or livenessprobe.
I was looking into this because if I have a pod with 3 containers, if 1 container wasn't running, nginx-ingress stopped serving traffic. I actually want to continue serving traffic if a certain container is still running, not all 3 running necessarily
Do you know or have any documentation on how nginx-ingress actually does its readiness/liveness checks?
ingress-nginx doesn't do readiness/liveness checks itself; the kubelet does[1].
If any of the containers in a Pod aren't ready, the endpoints controller[2] removes the Endpoint object corresponding to that Pod. ingress-nginx watches these Endpoints objects[3] to determine which Pods it should send traffic to.
Edit: As to your use case, I think you should remove the readiness probe from that one container you don't care about.
Yes, Ingress resources implicitly assume that you want to get HTTP(S) traffic into the cluster, but as for example, ingress-nginx is able to expose gRPC via additional annotations and generic L3 via related ConfigMap.
This isn't always true. On GKE for example, LoadBalancer is very expensive first of all. Second of all, if you use LetsEncrypt such as cert-manager, the GKE load balancer is awful to configure and takes forever to get a cert.
Also by using the GKE load balancer for ingress, you lose out on a lot of things, like password protection, or certain rules you want with nginx.
I've tried my best to stick with GKE LoadBalancer, but it's just an awful experience. Now I have GKE to load balance traffic to nginx-ingress, the level 3 load balancer is just not flexible enough and in general annoying to configure.
I think you're conflating the GKE Ingress Controller (which you don't have to use, and yes, confusingly it's named the GCLB controller, but that's because it configures L7 GCP LB's, not because it's there for k8s LBs) with the GKE load balancer controller/provider for LoadBalancer services (which you do have to use, even if you run nginx-ingress-controller).
And yes, I agree that both are extremely slow to reconfigure and the that Ingress controller is inflexible.
I'm not saying that you need to always use a LoadBalancer for all payloads. But you do need one to ingress traffic in a sensible manner if you're running your own N-I-C (which you have to do on bare metal, and which, as you said, you end up using even on GKE).
Keep an open mind, you can add as many HAProxy as you like, here for practical reasons it is shown with only 2 but it could be 3, 5, 7 ...
In time, this LB is for handling the availability and balance of the Control Plane, without direct exposure from the Master Nodes.
When we migrated to k8s we stuck with haproxy instead of using ingress for some of the reasons others have outlined already -- we've been running haproxy for a decade. Our configurations are tuned for our applications, we know the cpu usage and failure modes, and haproxy 1.9/2x support for srv records made it really easy. Bring able to trust k8s and our health checks and drop our previous 3 vm + vrrp setup was a no brainer!
That got us on k8s faster and gave us time to evaluate if we /really/ needed a service mesh or not (we don't, but the tracing is nice so we might still add istio yet). We may move to ingress based solutions eventually but our ecosystem is big and there's a lot more bigger fish to fry for now!
I'm tired of k8s as a hammer tool where every project looks like a nail.
It's scary to imagine how many servers are idling with zero traffic load doing nothing. Even in-house systems without any scaling ambition deployed to k8s.
Of course, you have to reserve 1 cpu + 1 gb for k8s needs on each server (you have 3 minimum) sitting here reserved, doing nothing. You can't use it for rare spikes of load (compiling, some rare data upgrades) and you have to buy additional resource to cover you needs while default resource requirements just idling because zero traffic of the system.
I'm sure it wasting more resources than all crypto-fads combined.
I'm happy to see more companies realise that and using just bare servers or docker swarm / nomad if they want to dive into DevOps and CI/CD practices.
K8s is a great tool for particular problems (you have at least tens/hundreds of servers, DevOps team more than 2 employees, you are getting at least 1k RPS at low traffic part for the day, have several SRE and so on).
Most probably your system is overengineered and resume driven if you need static k8s cluster and don't have any consumer facing interfaces (i.e. general low traffic usage pattern).
If you have different opinion I'll be happy to hear your points.
I think there is a lot of value in just following the k8s workflow for everything rather than a different deployment solution for every random app.
Not that I disagree with your post, I really dislike Kube and would like to see something else take its place, but there are situations where I could see using it just to use the same tools and workflows across the entire org.
Our company has been trying to get production sites on Kube for 4 years. We have had so many problems with our development sites that it just isn’t ready. Maybe at the end of the year.
OpenStack didn't go away for large users who need to run IaaS/*aaS. OpenStack codebase is/was a giant, warm pile, but honestly k8s isn't all that either. I've heard and seen tons of k8s horror stories, I won't touch it. Give me Xen/KVM/ESXi to chop up a large box into a virtual lab and then Docker inside that on a custom Docker VM. These giant, support everything (poorly) "solutions" aren't all that helpful.
The problem of hype is that it's usually to evangelize marginal projects while better ones march along in obscurity.
Kubernetes is a nice choice for specific usecases (like, well, battle-tested container orchestrating & scheduling), but overhyping it with references like "Kubernetes is a data center operating system" is absolute mess, IMO. You can't put everything in Kubernetes.
And it's also applies to OpenStack which is more than alive. While its ecosystem consists of horde of different solutions, plain nova/neutron/cinder are matured and efficient solutions to roll your own private IaaS.
Well if you're uncomfortable with its priviged daemon, you can always switch to CRI-O with Red Hat tooling for it. But for all my years with Docker as the container runtime, all security related problems have occured within the backend code, not Docker, not Linux cgroups, not Linux itself.
I've worked with some big customers in the financial industry, and this is exactly what we do. Podman implements the same CLI as docker, so you can basically just `s/docker/podman/g` (as long as you don't use docker-compose).
It's also a lot easier to debug and see what's happening without that daemon sitting in the middle of all the traditional linux tools.
containers in general are horrible wrt security because they are architecturally flawed - they pretend to have some sort of 'isolation' but that was crap docker marketing people just made up - there is no isolation - k8s pushes this agenda further by declaring that multi-tenant workloads are perfectly normal and ok for containers which they absolutely are not
just look at the CVEs from recent years:
* docker doomsday
* escaping like a rkt
* cryptojacking? - that didn't even exist until containers were here!
If you have a L3 IP network within your datacenter, doing a BGP anycast of the VIP to a bunch of HAP servers is the best way to go. With ECMP, you can have your cake and eat it too – all of your HAP nodes will be handling traffic and when one of the nodes die, other nodes pick up the load nicely.
everything old is new again? If there is not any good easy method in Kubernetes as q3k describes than why should somebody go into this overengineering approach?
> as someone that works in operations.. This is pretty common think. Lets still do things how we did it on bare metal inside kubernetes.
Because haproxy and nginx are proven technologies with limited and very well known failure modes, which means there's exactly 5-6 well documented, well understood and very well known ways haproxy and nginx can fail.
Experienced ops people understand that one does not optimize for the blue skies -- during "everything works wonderfully" all non completely broken technologies perform at approximately the same level. Rather these people optimize for the quick recovery from the "it is not working" state.
I believe the OPs point is that using HAProxy with a floating IP is a bit of anti-pattern in Kubernetes. The idiomatic Kubernetes way would be to use an ingress object. Both HAProxy and Nginx have ingress controllers for Kubernetes which use a load balancer in front of them. The article then goes on to talk using corosync and pacemaker which are cluster technologies in their own right. This is really bizarre. Running a cluster on a cluster would not be many people's idea of "optimizing for quick recovery."
The article, though, suggests also adding corosync and pacemaker. So 4 things on top of the already complex K8S. I bet someone later throws in a service mesh. Imagine troubleshooting all that.
That's exatcly in order to blow in this fog that you want to install a layer7 load balancer designed with observability in mind. Seeing what's happening is critical when everything changes by itself under your feed. With a component like haproxy you get accurate and detailed information about what's happening and the cause of occasional failures allowing you to address them before they become the new standard.
1000 times this. Corosync and pacemaker alone are more or less as complex as K8S itself. Well, I'm exaggerating a bit but really, all the HA clusters that I saw, done with corosync in the past 10 years ended up failing anyway (and with fireworks!) one way or another.
Add this on top of Kubernetes? No, thanks. Life is stressful enough.
Yup. Corosync + Pacemaker can and will implode in a spectacular fashion exactly when you don't want it to. And 2 node cluster will split brain sooner rather than later. I'd rather use keepalived if required, since it's a lot easier to understand and manage.
I agree, that's a very odd way of addressing a problem of entry points failure:
* If one is doing it in a cloud and wants to avoid potential issues with the clients behind broken DNS resolvers, then one simply nails the entry point instances to specific IP addresses and in event of an entry point failure, assigns an IP address of a failed instance to a standby instance, resulting in a nearly immediate traffic swap.
* If one is running entry points on physical hardware, then the solution is to bind the entry points to a virtual IP address floating between the instances using VRRP.
* Finally, if one wants to be really super-clever and not drop sessions using controlled fail-over, one does does VRRP + service mac addresses similar to Fastly's faild or Google's MagLev. (But really, this is addressing 99.99999% reliability when in most of business cases 99.9% would do just fine and 99.99% would be amazing ).
You wouldn't use VRRP in these cases these days - since the networking world has moved on and now L3 is generally pushed to ToRs, you would use BGP to announce a /32 or /128 and configure an ECMP group on the ToR. This gives you not only redundancy then, but also traffic splitting. Maglev does use BGP, too (although in a way indirectly due to scale reasons).
Just because a customer has access to L2 fabric via a VLAN between customer's hardware does not mean the customer has access to be able to tweak ToR switch configuration.
Using BGP on a L2 vlan for handling a single digit number of IP addresses allocated to the entry points is akin to using K8S to host a static HTML page.
In addition, it seems like the quorum configuration shown here is pretty poor - only uses two nodes. This is a recipe for split-brain.