AWS Controllers for Kubernetes

sytse · on Aug 25, 2020

Kelsey Hightower had a good take in https://twitter.com/kelseyhightower/status/12963119515307048... "AWS Controllers for Kubernetes is pretty dope. You can leverage Kubernetes to manage AWS resources such as API gateways and S3 buckets. Think Terraform but backed by Kubernetes style APIs and "realtime" control loops."

An in the thread he mentions Crossplane as the cross-cloud way to do this https://twitter.com/kelseyhightower/status/12963213771342315...

rektide · on Aug 25, 2020

Every Kubernetes shop I've seen uses Helm to deploy their workloads, & now you can set up all your S3 Buckets, SQS, SNS stuff in that same Helm chart as your app. This is hella ideal.

I also just generally love the thought of being able to manage any cloud resources at all via standard open-protocols & systems. Compounding the investment, rather than having to invest in a bunch of specific not-interconnected areas is going to lead to great things.

It'll be interesting to see what if any architectural flourishes or innovations went in to ACK's control loops.

coryodaniel · on Aug 25, 2020

It’s all fun and games until a deployment is acting funny and somebody destroys the chart and re-creates it and takes the bucket out with it.

Does it have functionality like Kube DB that makes a “dormant” version of the state store?

jrockway · on Aug 25, 2020

Deletion is typically specified by an immutable finalizer on the resource. For example, you can create a StorageClass that saves your PersistentVolume when you delete the PersistentVolumeClaim, and you can't change the class of a claim, so even if your templating goes haywire and deletes your volume claim, the actual data is safe and can be recovered. I have to imagine that any API that lets you create things like RDS instances would have similar protection. (You might think that the k8s world is very mutable, but there are some immutable things hanging around. For cases exactly like this.)

I don't use AWS so I didn't look into this thoroughly, but the word "finalizer" does show up in the code frequently:

    // MarkManaged places the supplied resource under the management of ACK.
    // What this typically means is that the resource manager will decorate the
    // underlying custom resource (CR) with a finalizer that indicates ACK is
    // managing the resource and the underlying CR may not be deleted until ACK
    // is finished cleaning up any backend AWS service resources associated
    // with the CR.

I dunno how good it is, but clearly they've thought about it. Test it before you invest billions of dollars into it, though.

dilyevsky · on Aug 25, 2020

Finalizers dont do much for safety. They are simply there to ensure controller (in this case ACK) won’t miss the deletion event and leave the resource dangling. To actually prevent object from being deleted you need a validation webhook

robszumski · on Aug 25, 2020

An interesting part of the Operator Lifecycle Manager (OLM) is the capability to scaffold webhooks for Operators, rotate their secrets, etc to make this type protection easy for everyone to provide. All you need to do is bring your validation logic specific to the app.

https://olm.operatorframework.io/docs/advanced-tasks/adding-...

an_opabinia · on Aug 26, 2020

Kubernetes complexity: The amount of indirection for “you must write a --force flag to delete something” is astounding.

DanielDent · on Aug 25, 2020

What this describes is a mechanism for k8s to delete underlying resources prior to removing them from k8s.

E.g. before completely removing the CRD representing an S3 bucket, this provides a mechanism for that S3 bucket to be deleted from AWS systems.

Which I think is the opposite of what you were hoping.

nrmitchi · on Aug 25, 2020

This is, in my opinion, the most dangerous thing about CRDs, especially for resources that are not scoped the the cluster.

It is way to easy to accidently delete something you need. PV's were my first experience dealing with this (a chart upgrade recreated a PVC, the old one unbound, and was immediately cleaned up), and it's not a risk that I want to see extended to buckets, RDS instances, etc.

The other side of this is that CRDs can lead to abandoned resources; if you find your cluster borked, or shut down improperly, any resources which existed as CRDs (or in cloud Kubernetes land, include LoadBalancers) probably did not get cleaned up, and will be abandoned (but left running).

It's not clear that there is actually a good solution here that fits neatly in with existing CRD behaviour.

rektide · on Sept 1, 2020

If you've used helm for more than one day you'd know about the helm.sh/resource-policy: keep,

Which makes your first big "accidentally deleted" concern pretty near moot, so long as it's your ci/CD tools with permissions & they are using helm. So long as random cluster users aren't futzing around randomly poking at & deleting things.

Yes, you have to go way further & prevent your idiot users from being dumb, if you are giving them cluster role permissions. But again, have you considered not doing that?

Abandoned resources is no less of an issue. If there are external resources that don't have state (such as your Load balancer examp, the good news is you can delete them all & let the resource controllers recreate the real ones.

In general I feel like you are letting the 1% of concerns dominate & dissuade you, & that most people can make it very far with nothing extra.

nrmitchi · on Sept 2, 2020

> If you've used helm for more than one day

I'm not sure why you're popping in to a thread from a week ago, starting with personal insinuations against my qualifications to have an opinion, and being a generally uncivil person.

You should try to make your points without the condescending attitude.

rektide · on Sept 1, 2020

metadata: annotations: helm.sh/resource-policy: keep

works for any & all resources!

takeda · on Aug 25, 2020

I thought whole point of running K8S in AWS was to have infrastructure not tied to their offerings.

If you use something like that, why use K8S and not just use AWS services natively?

scarface74 · on Aug 25, 2020

The entire idea of your infrastructure not being tied to your provider is completely lost when you are at any type of scale. I have a feeling that anyone who thinks it is easy or even usually worth it has never been part of planning a large scale migration.

At the same time, you usually end up spending more money and having worse results when you don’t go all in.

As far as why use EKS vs ECS - the “native service”? They seem to have feature parity, ECS is easier to use for the unitiated. But, there are so many people who know k8s and your knowledge is portable.

Which brings up my second point. Most software engineers don’t care about cloud mobility as much as they claim. They care about career mobility. There is a much better chance that you will leave a company and move to a company on a different provider than your company will. I’m not saying it’s a bad thing to focus on technologies that give you as an individual the most optionality.

dilyevsky · on Aug 26, 2020

> The entire idea of your infrastructure not being tied to your provider is completely lost when you are at any type of scale

Pretty categorical statement and I don’t find this to be true at all if you avoid using anything managed except block devices and vms themselves. That is unless “any type or scale” is tens of thousands of VMs and petabytes of storage in which case it is indeed a moot endeavor since someone who is still on public cloud at that point apparently likes to give all their money to cloud vendors anyway.

scarface74 · on Aug 26, 2020

If all you’re doing is using a cloud provider to host a bunch of VMs, you’ve already lost the plot. Once you use any cloud provider as a glorified colo without changing any of your processes you’re already spending more than a colo and not getting any of the benefits of managed services.

Have you ever budgeted a project plan involving a large migration.

dilyevsky · on Aug 26, 2020

> Once you use any cloud provider as a glorified colo without changing any of your processes you’re already spending more than a colo and not getting any of the benefits of managed services.

So you’re saying running multi regional, burstable (metered by minutes or even seconds) VMs with redundant power and networking and without having to hire a bunch of maintenance staff is not one of the benefits? Imho it’s perfectly economical at small to medium scale.

> Have you ever budgeted a project plan involving a large migration

I’m doing one right about now. If you design your stack to be multi-regional from the get go and dont get yourself into aforementioned cloudlock it’s a simple matter of bringing up new kubernetes clusters in a new cloud region and possibly setting up some peering.

parasubvert · on Aug 26, 2020

“Managed services” in cloud are a Faustian bargain.

It’s also totally untrue that you need to adopt any specific proprietary features of public cloud to change your IT processes to break the ITIL straightjacket into a sane process. It’s just an abstraction, and there are plenty to choose from that work with whatever you already have.

First of all, they’re not actually managed services as we typically expect. The support experience is much more like a “hosted service”, with very circumscribed boundaries around feature scope, mostly because it is managed entirely by the kind robots rather than having humans actually manage anything.

Secondly, they’re lock in for debatable benefit.

Third, they’re almost always the way the clouds make their margin and you’re stuck with their design decisions, and have little control plane adjustability. Running Google Cloud SQL is expensive and crappy, I would much rather run my own DB on a Kubernetes cluster via a mature and commercially supported Kubernetes operator. Because it’s automated, reliable, and if I need to, I can get into the weeds.

Because the dirty secret is that all of the managed cloud services are just software, no different from what you install, they’ve just closed source certain components to make it seem magical. And yes you can file a ticket to get them to fix it, but remember the law of outsourced availability: you still own your SLO to your customer, you still own your end to end availability. All the “managed cloud” will do is cut you a check for the wasted pennies on a bill, while your customers leave and business tanks.

If you treat cloud like a commodity player that you can fire (not likely - more “shrink investments in favor of an alternative”) every few years if their quality goes down or price goes up, you’re in a much better position for cost, performance, and reliability. Same as it always was. The “one Throat to choke” theory of outsourcing was a always business school bullshit line that never really worked, and that now has been adopted by tech experts that want to ensure their cloud-specific skills are always relevant, regardless of whether these all proprietary features are really a good idea.

Speed + having no idea what you’re doing = use all the managed services!, this seems to be the cloud pundit elevator pitch. but as with anything, if you don’t know what you’re doing, at some point you will be caught being a sucker.

The multi-cloud, open source ecosystem continues to grow over many proprietary cloud services for these reasons.

scarface74 · on Aug 26, 2020

You act as if people to manage servers are free. How much time are you spending to avoid “cloud lock-in”? How much money are you spending on the extra support staff or are you making your developers do double duty? Is doing the grunt work of managing servers and services helping you win against your competitors? Is it helping you to go to market faster?

I’m not just speaking about cloud providers. Do you know how many healthcare systems are so tightly “locked in” to their EMR/EHR systems that it would make you cry?

So is “multi cloud” saving you money?

echelon · on Aug 26, 2020

Managed Cloud is Oracle all over again.

Engineers who job hop don't care, but they leave their orgs paying Amazon forever.

I'm doing this right now, but I know a boondoggle when I see one.

scarface74 · on Aug 26, 2020

Would you also tell hospitals not to lock themselves into their EMR/EHR? Would you tell Enterprises to get rid of their entire dependence on Microsoft and go Linux only?

parasubvert · on Aug 26, 2020

Don’t be patronizing.

Management needs to act, and locking yourself in, if it’s the only choice, go for it.

In the case of IT, composing across clouds is pretty common because it solves problems. Note I am not talking about brokerage or constant migration or other such nonsense. I’m talking about treating your clouds as a portfolio where you have levels of investment for particular reasons, and can make long run decisions about upping or lowering investment in. You compose across them. Salesforce for CRM, AWS for my front end apps, Google for my analytics, Azure and O365 for internal apps, whatever. It’s reality.

Forcing everything on one cloud, boy I donno. For an SMB, sure.

My point is, don’t tell me how tasty and good for me the shit sandwich I am forced to eat is. Users of EPIC hate it, that’s a market opportunity.

parasubvert · on Aug 26, 2020

Yes, it is helping me go to market faster. Because the biggest lie people tell themselves is that you can separate the operations of a service from its development. They’re intrinsically linked.

Netflix for example has a metric ton of OSS they’ve built to ensure smooth global operations even though they for a long time ran exclusively on AWS, and have now gone multi-cloud. I wonder why that is? Could it be that running a large service efficiently requires more than clicking some boxes and swiping a credit card?

Apple also is multi cloud and has been investing heavily in Kubernetes talent. I wonder why that is?

Handing developers the keys to the use any managed cloud service without regard to quality or price and especially when they don’t know what they’re doing operationally, hurts my go to market and risk posture when

- I get hit by hacks because they didn’t understand the security posture of the service

- I get hit by bugs that they didn’t know existed But thought that cloud magically makes software bug free

- The solution doesn’t perform or scale because the developers didn’t understand the nuances of a particular cloud’s networking or storage abstraction

- I have 15 different ways deploys of solving the same problem and nothing coordinated

- I have outages due to poor AZ or region planning

Now look, I think it’s great to avoid undifferentiated heavy lifting. I wasn’t implying that I was racking and stacking , or managing my own artisanally installed Kubernetes clusters, I have one that is automated from one of the many solutions out there.

This is purely about choosing the abstractions that I run my business on. I would rather use open and composable abstractions than proprietary dead ends... UNLESS it’s a throw away experiment or skunkworks and even then I’d be skeptical.

Tell me what staff I am saving by running a database on a Kubernetes cluster vs. RDS or Cloud SQL? I still maybe need a DBA. I still need to have someone keep an eye on upgrades , availability, and Performance across the fleet (which aren’t necessarily automagical). I still am responsible for the overall security posture.

We aren’t talking about racking and stacking servers and switches here, we are talking about administrator to server or administrator to developer ratios. And while you certainly can have a terrible ratio with Kubernetes (it’s early days) you also can get that down to a very respectable number with good practices and modern tools + abstractions.

As an aside, monolithic EMR/EHR systems are the bane of most health systems’ existence and the vendors like Cerner and EPIC know they have to change (to various degrees) or they will be changed by the market.

Multi cloud undoubtedly saves me money and risk because (a) there is no large company in existence outside of AWS, Google and Microsoft that won’t be multi cloud due to M&A and risk posture (b) I can use the right tools for the job in the particular cloud if they do have a unique offering (c) I can play them off each other when prices and quality suffer. (D) I can use the OSS that is driving the industry rather than using expensive knock offs (looking at you Kinesis!)

This belief that you can pick one outsourcing vendor and you’ll be better sticking with it for all your needs until the end of time is the height of arrogant naivety and has been historically disproven. But every generation needs to learn the hard way.

takeda · on Aug 27, 2020

Really good points, I especially agree about RDS.

I've been in companies that used managed service and running database on premises. As long as you used some kind of configuration management software and deployed tools like barman and repmgr, there wasn't any operational work. Only exception was a hardware failure, but that doesn't happen often and when it happens you spin a new machine and invoke few commands (these tools mentioned make it really easy) and are back in business. You also have more control over how replication and failover is set up (so it's tuned to your use case) or have ability to use all available postgres extensions. The only "good" thing about RDS is that management can't pressure you to do some things. For example you can say there must be a downtime to upgrade the database and there's not much that can be done about it.

You might run into performance issues, where you might need a DBA, but RDS doesn't solve this, that part is still exactly the same, and RDS gives you less control.

BTW: I also noticed that AWS really makes things so RDS service feels less attractive. For example if you want to create a serverless postgresql, I don't think you can do it from UI. I was only able to do it from cloudformation, the only version available is 10.7. If you allow instance to be put to sleep it requires 30 minutes to spin up, and AWS Gateway has 29s (yes 29s) maximum timeout, that can't be increased.

I think AWS wants to discourage use of postgres, and encourages everyone to use DynamoDB (which doesn't have 1:1 equivalent outside of AWS). I'm wondering if it's due to introduction of logical replication in version 10. The biggest asset of most businesses is the data, and not being able to get it out without a lengthy downtime might be the reason to keep businesses from moving.

harpratap · on Aug 26, 2020

This is just one small step in that direction. Once everyone is on the same page about Infrastucture-as-Data and behaves in a consistent and expected manner, we can move forward with wrapping these interface. Crossplane is a one such project.

k__ · on Aug 25, 2020

Is there something like CloudFormation console for EKS/ACK?

last time I deleted a cluster it failed because of a NLB still being around but not accounted for as CFN resource, even though I provisioned the cluster with CFN.

mfer · on Aug 25, 2020

One consequence is the accidental deletion of AWS things...

If a CRD is deleted the CRs described it are also deleted. So, deleting a CRD (even accidentally) could end up deleting resources in AWS (e.g., backups). So, be careful.

Some things being managed by Kubernetes would be really cool. Other things being managed by k8s could break things if something goes wrong. I would plan accordingly.

NathanKP · on Aug 25, 2020

Hi I'm a developer advocate in the AWS Container organization. This question of how to handle resource destruction is an active issue on the project, and we'd welcome your comments (or those of anyone else) on this Github issue: https://github.com/aws/aws-controllers-k8s/issues/82

Our goal is to make this project have "no surprises" and therefore no unexpected destruction of resources. The specifics of how we mark resource as safe to delete instead of retaining by default are under discussion on that Github issue.

brnkrygs · on Aug 25, 2020

One way our team has dealt with this question in a highly serverless-oriented and infrastructure-as-code-driven (how's THAT for buzzword soup) environment is to explicitly separate stateful resources from stateless, while exposing reference hooks in a configuration store to cross from one to the other. We've found that doing so _greatly_ reduces the blast radius of mistakes and lets us move more quickly and confidently.

The stateless stacks generally have a lot of development activity going on, and rapidly iterate. This is where most of our code and logic lives. This is where the vast majority of our deployment (and related cloud configuration) activity happens.

All of that thrash is kept away from the stateful stacks - think S3 buckets or DynamoDB tables - where, if THOSE thrash, we potentially get an outage at best, or lose data at worst (backups notwithstanding).

We DO NOT WANT stateless oriented stacks to own the lifecycle for stateful stacks. They inherently need to be treated differently. Or, at least the impact of mistakes is different.

The trick comes when you need to tie them together. To do this, we've added CloudFormation hooks and other deployment time logic that publish ARN and other connectivity info to our configuration store. The stateless services look up config values either during deployment or at runtime and are able to find the details they need to reference the state resources they need access to.

We've poked at toolsets like Amplify that lump everything together and have already been bitten numerous times. We've found that the difference between stateful and stateless resources should not be papered over, but instead emphasized and supported explicitly by tooling.

... all of this being one team's experience over the years, of course.

Very curious to see how this paradigm evolves here!

[edit]… Riffing on this just a little bit further… as I’m thinking about it here, it comes down to abstraction level. In a deployment or resource management domain, a generic “this is a cloud resource” isn’t very useful. What’s way _more_ useful is something like “this is a stateful resource” or “this is a stateless resource”, because that level describes resource behavior more clearly, AND how to interface with or manage those resources.

There are echos of code development principles here intentionally - robust cloud infrastructure management is mirrors software dev practices as much as infrastructure management ones!

Rapzid · on Aug 26, 2020

Went a very sim route with Serverless and Terraform. Things we couldn't afford to lose went into Terraform(RDS instances, Kinesis, etc). AWS resources that were more coupled to the individual services and could be deleted and recreated went into the serverless configs.

A large reason for this was that we couldn't trust CloudFormation or Lambda to not require blasting away and recreating resources. At least at the time it was not uncommon for a stack to get "stuck" or a lambda function to stop having it's EINs properly configured(essentially stuck).

bassman9000 · on Aug 25, 2020

I sincerely hope this works well. CloudFormation has already many issues with unintended changes, visibility of the changes to be done, auto update/deletion of resources, etc. Having another path to provision infrastructure that is safe, consistent, and provides visibility at all phases would be great.

digitallogic · on Aug 25, 2020

Also, not all AWS follow the same deletion semantics. Example: S3 buckets. The report as being deleted somewhat quickly, but their name may not be available again for hour or so.

In this case the delete will appear to succeed, but the recreation, if done with the same name, may fail.

freedomben · on Aug 25, 2020

Great example. I worry about adding a layer of abstraction over provisioning resources this way.

Of course we have to try this because it's a badass (tho obvious in hindsight) idea, but in practice it might have some downsides.

jpeeler · on Aug 29, 2020

Does the create during this time window return a specific enough error? This seems like the exact case where a controller that never gives up could provide value. Though I'm kind of amazed this is on the order of an hour instead of minutes.

mafro · on Aug 26, 2020

This issue already exists with, say, Terraform to orchestrate infra with code. The solution is to append a random hex string to the resource unique identifier.

This AWS project will need to support a feature like that.

outworlder · on Aug 25, 2020

Exactly. Planning accordingly is the right answer.

Assume anyone can destroy your infrastructure at any time (by mistake or otherwise). This could be done with cloudformation, terraform, API calls, and essentially any automation(with different levels of safeguards).

Be prepared for that. Be careful with your data. Not so careful with individual servers - they should be cattle, not pets.

EDIT: If this is a production system, one could take away any 'delete' permissions until they are needed again.

ForHackernews · on Aug 25, 2020

"cattle, not pets" is a nice slogan, but I think most ranchers would be angry if a junior ranch hand accidentally poisoned the entire herd with a typo.

cactus2093 · on Aug 25, 2020

I think you have mostly the same problems in Terraform, Pulumi, or Cloudformation though right? Is there anything that makes it easier to accidentally do in k8s?

One layer of defense in all of these cases is keeping the IAM credentials that the configuration management tool uses from having any deletion permissions.

scarface74 · on Aug 25, 2020

That’s an excellent idea. I never thought about that. In hindsight, it makes perfect sense.

Also, set the DeletionPolicy in CF to false.

mey · on Aug 25, 2020

There are certain things that freak me out in IAC, like loosing a primary DB or secrets. I want IAC to create them, wire things together and index them, but the idea of an accidental mis-configuration deleting a production database gives me heartburn.

uHuge · on Aug 25, 2020

That's what you should have QA pipeline ready to check before you'll be able to apply the config, right? Secrets should be encrypted in git to my best awareness.

compsciphd · on Aug 25, 2020

I started building something along this line a few years ago. the ability to control AWS VM and treat them as pods (i.e. can backend a service, access other services) to have a hybrid (VM / Container) infrastucture that is all managed in the kubernetes way. Future work would have been to try and manage other resources similarly.

Sadly startup interest changed and then went under (but the freedom I was given to explore there was the best experience I have ever had)

https://github.com/apporbit/infranetes

buzer · on Aug 25, 2020

Some prior discussion: https://news.ycombinator.com/item?id=24219448

just-juan-post · on Aug 25, 2020

You can create a S3 bucket but you can't set permissions on it.

Pass, I'll check back in a year.

cagenut · on Aug 25, 2020

On the one hand this could be such a cool and powerful concept.

On the other hand my brain segfaults on the recursive loop of how the layer-inversion gets modeled as IaC with a CI/CD pipeline. I guess if you were very strict about having your provider-infra layer (cloudformation/terraform) do only the bare minimum to get your kube environment up, and then within that kube environment you used something like ACK to provision any cloud-provider resources that your kube-managed apps/pipelines needed.

Yet another case where I'm like "I don't know if kube should be the answer to everything, but I sure as shit won't miss <x>".

rektide · on Aug 25, 2020

Yes, the goal is very much to bring provisioning & operations of all resources on to the core platform we're using, Kubernetes. I mentioned in another reply how excited I am to be able to manage things like SQS queues now with my app, via Helm Charts, rather than need separate machinery to manage/operate my app & the various resources it needs.

And now there is a control loop. So if Ben in support accidentally deletes my queue, it'll get recreated.

Breaking away from the proprietary platform underlay is going to be great. Managing things more consistently is going to be really great.

HereBeBeasties · on Aug 25, 2020

See also KubeForm - https://kubeform.com/ which will do a lot of this already, via Terraform.

nikolay · on Aug 26, 2020

I Terraform Kubernetes Operator sounds like a better idea.

MuffinFlavored · on Aug 25, 2020

How is this different than Terraform?

harpratap · on Aug 26, 2020

Traditional Infrastructure-as-Code solutions were based on edge-triggers. Like create ingress, delete ingress. But what about when ingress misbehaves or is in an unrecoverable state?

Kubernetes introduced edge-triggered level-driven with resync reconciliation based "controllers". User defines a state and the controller does it's best to keep the infra in this desired state all the times. (Although Terraform has also moved to this same design in recent times)

This establishes a consistent experience. Everyone knows you just need to do kubectl get my-resource to check your desired state. All the issues will be logged in status and controller. You can combine multiple controllers to achieve your desired application design. For example, Knative has their own kind called "Service" which has some custom components, some inherited from istio and things like replicasets from default kubernetes controllers.

brian_herman__ · on Aug 25, 2020

This reminds me of this classic comic. https://www.catmuseumsf.org/images/print/comix/bill.jpg

femto113 · on Aug 26, 2020

Or almost any Cathy strip https://mirabiledictu.org/cathy-ack-2/

francesca · on Aug 25, 2020

Reminds me of Nantucket's airport