Kelsey Hightower had a good take in https://twitter.com/kelseyhightower/status/12963119515307048... "AWS Controllers for Kubernetes is pretty dope. You can leverage Kubernetes to manage AWS resources such as API gateways and S3 buckets. Think Terraform but backed by Kubernetes style APIs and "realtime" control loops."
Every Kubernetes shop I've seen uses Helm to deploy their workloads, & now you can set up all your S3 Buckets, SQS, SNS stuff in that same Helm chart as your app. This is hella ideal.
I also just generally love the thought of being able to manage any cloud resources at all via standard open-protocols & systems. Compounding the investment, rather than having to invest in a bunch of specific not-interconnected areas is going to lead to great things.
It'll be interesting to see what if any architectural flourishes or innovations went in to ACK's control loops.
Deletion is typically specified by an immutable finalizer on the resource. For example, you can create a StorageClass that saves your PersistentVolume when you delete the PersistentVolumeClaim, and you can't change the class of a claim, so even if your templating goes haywire and deletes your volume claim, the actual data is safe and can be recovered. I have to imagine that any API that lets you create things like RDS instances would have similar protection. (You might think that the k8s world is very mutable, but there are some immutable things hanging around. For cases exactly like this.)
I don't use AWS so I didn't look into this thoroughly, but the word "finalizer" does show up in the code frequently:
// MarkManaged places the supplied resource under the management of ACK.
// What this typically means is that the resource manager will decorate the
// underlying custom resource (CR) with a finalizer that indicates ACK is
// managing the resource and the underlying CR may not be deleted until ACK
// is finished cleaning up any backend AWS service resources associated
// with the CR.
I dunno how good it is, but clearly they've thought about it. Test it before you invest billions of dollars into it, though.
Finalizers dont do much for safety. They are simply there to ensure controller (in this case ACK) won’t miss the deletion event and leave the resource dangling. To actually prevent object from being deleted you need a validation webhook
An interesting part of the Operator Lifecycle Manager (OLM) is the capability to scaffold webhooks for Operators, rotate their secrets, etc to make this type protection easy for everyone to provide. All you need to do is bring your validation logic specific to the app.
This is, in my opinion, the most dangerous thing about CRDs, especially for resources that are not scoped the the cluster.
It is way to easy to accidently delete something you need. PV's were my first experience dealing with this (a chart upgrade recreated a PVC, the old one unbound, and was immediately cleaned up), and it's not a risk that I want to see extended to buckets, RDS instances, etc.
The other side of this is that CRDs can lead to abandoned resources; if you find your cluster borked, or shut down improperly, any resources which existed as CRDs (or in cloud Kubernetes land, include LoadBalancers) probably did not get cleaned up, and will be abandoned (but left running).
It's not clear that there is actually a good solution here that fits neatly in with existing CRD behaviour.
If you've used helm for more than one day you'd know about the helm.sh/resource-policy: keep,
Which makes your first big "accidentally deleted" concern pretty near moot, so long as it's your ci/CD tools with permissions & they are using helm. So long as random cluster users aren't futzing around randomly poking at & deleting things.
Yes, you have to go way further & prevent your idiot users from being dumb, if you are giving them cluster role permissions. But again, have you considered not doing that?
Abandoned resources is no less of an issue. If there are external resources that don't have state (such as your Load balancer examp, the good news is you can delete them all & let the resource controllers recreate the real ones.
In general I feel like you are letting the 1% of concerns dominate & dissuade you, & that most people can make it very far with nothing extra.
I'm not sure why you're popping in to a thread from a week ago, starting with personal insinuations against my qualifications to have an opinion, and being a generally uncivil person.
You should try to make your points without the condescending attitude.
The entire idea of your infrastructure not being tied to your provider is completely lost when you are at any type of scale. I have a feeling that anyone who thinks it is easy or even usually worth it has never been part of planning a large scale migration.
At the same time, you usually end up spending more money and having worse results when you don’t go all in.
As far as why use EKS vs ECS - the “native service”? They seem to have feature parity, ECS is easier to use for the unitiated. But, there are so many people who know k8s and your knowledge is portable.
Which brings up my second point. Most software engineers don’t care about cloud mobility as much as they claim. They care about career mobility. There is a much better chance that you will leave a company and move to a company on a different provider than your company will. I’m not saying it’s a bad thing to focus on technologies that give you as an individual the most optionality.
> The entire idea of your infrastructure not being tied to your provider is completely lost when you are at any type of scale
Pretty categorical statement and I don’t find this to be true at all if you avoid using anything managed except block devices and vms themselves. That is unless “any type or scale” is tens of thousands of VMs and petabytes of storage in which case it is indeed a moot endeavor since someone who is still on public cloud at that point apparently likes to give all their money to cloud vendors anyway.
If all you’re doing is using a cloud provider to host a bunch of VMs, you’ve already lost the plot. Once you use any cloud provider as a glorified colo without changing any of your processes you’re already spending more than a colo and not getting any of the benefits of managed services.
Have you ever budgeted a project plan involving a large migration.
> Once you use any cloud provider as a glorified colo without changing any of your processes you’re already spending more than a colo and not getting any of the benefits of managed services.
So you’re saying running multi regional, burstable (metered by minutes or even seconds) VMs with redundant power and networking and without having to hire a bunch of maintenance staff is not one of the benefits? Imho it’s perfectly economical at small to medium scale.
> Have you ever budgeted a project plan involving a large migration
I’m doing one right about now. If you design your stack to be multi-regional from the get go and dont get yourself into aforementioned cloudlock it’s a simple matter of bringing up new kubernetes clusters in a new cloud region and possibly setting up some peering.
“Managed services” in cloud are a Faustian bargain.
It’s also totally untrue that you need to adopt any specific proprietary features of public cloud to change your IT processes to break the ITIL straightjacket into a sane process. It’s just an abstraction, and there are plenty to choose from that work with whatever you already have.
First of all, they’re not actually managed services as we typically expect. The support experience is much more like a “hosted service”, with very circumscribed boundaries around feature scope, mostly because it is managed entirely by the kind robots rather than having humans actually manage anything.
Secondly, they’re lock in for debatable benefit.
Third, they’re almost always the way the clouds make their margin and you’re stuck with their design decisions, and have little control plane adjustability. Running Google Cloud SQL is expensive and crappy, I would much rather run my own DB on a Kubernetes cluster via a mature and commercially supported Kubernetes operator. Because it’s automated, reliable, and if I need to, I can get into the weeds.
Because the dirty secret is that all of the managed cloud services are just software, no different from what you install, they’ve just closed source certain components to make it seem magical. And yes you can file a ticket to get them to fix it, but remember the law of outsourced availability: you still own your SLO to your customer, you still own your end to end availability. All the “managed cloud” will do is cut you a check for the wasted pennies on a bill, while your customers leave and business tanks.
If you treat cloud like a commodity player that you can fire (not likely - more “shrink investments in favor of an alternative”) every few years if their quality goes down or price goes up, you’re in a much better position for cost, performance, and reliability. Same as it always was. The “one Throat to choke” theory of outsourcing was a always business school bullshit line that never really worked, and that now has been adopted by tech experts that want to ensure their cloud-specific skills are always relevant, regardless of whether these all proprietary features are really a good idea.
Speed + having no idea what you’re doing = use all the managed services!, this seems to be the cloud pundit elevator pitch. but as with anything, if you don’t know what you’re doing, at some point you will be caught being a sucker.
The multi-cloud, open source ecosystem continues to grow over many proprietary cloud services for these reasons.
You act as if people to manage servers are free. How much time are you spending to avoid “cloud lock-in”? How much money are you spending on the extra support staff or are you making your developers do double duty? Is doing the grunt work of managing servers and services helping you win against your competitors? Is it helping you to go to market faster?
I’m not just speaking about cloud providers. Do you know how many healthcare systems are so tightly “locked in” to their EMR/EHR systems that it would make you cry?
Would you also tell hospitals not to lock themselves into their EMR/EHR? Would you tell Enterprises to get rid of their entire dependence on Microsoft and go Linux only?
Management needs to act, and locking yourself in, if it’s the only choice, go for it.
In the case of IT, composing across clouds is pretty common because it solves problems. Note I am not talking about brokerage or constant migration or other such nonsense. I’m talking about treating your clouds as a portfolio where you have levels of investment for particular reasons, and can make long run decisions about upping or lowering investment in. You compose across them. Salesforce for CRM, AWS for my front end apps, Google for my analytics, Azure and O365 for internal apps, whatever. It’s reality.
Forcing everything on one cloud, boy I donno. For an SMB, sure.
My point is, don’t tell me how tasty and good for me the shit sandwich I am forced to eat is. Users of EPIC hate it, that’s a market opportunity.
Yes, it is helping me go to market faster. Because the biggest lie people tell themselves is that you can separate the operations of a service from its development. They’re intrinsically linked.
Netflix for example has a metric ton of OSS they’ve built to ensure smooth global operations even though they for a long time ran exclusively on AWS, and have now gone multi-cloud. I wonder why that is? Could it be that running a large service efficiently requires more than clicking some boxes and swiping a credit card?
Apple also is multi cloud and has been investing heavily in Kubernetes talent. I wonder why that is?
Handing developers the keys to the use any managed cloud service without regard to quality or price and especially when they don’t know what they’re doing operationally, hurts my go to market and risk posture when
- I get hit by hacks because they didn’t understand the security posture of the service
- I get hit by bugs that they didn’t know existed But thought that cloud magically makes software bug free
- The solution doesn’t perform or scale because the developers didn’t understand the nuances of a particular cloud’s networking or storage abstraction
- I have 15 different ways deploys of solving the same problem and nothing coordinated
- I have outages due to poor AZ or region planning
Now look, I think it’s great to avoid undifferentiated heavy lifting. I wasn’t implying that I was racking and stacking , or managing my own artisanally installed Kubernetes clusters, I have one that is automated from one of the many solutions out there.
This is purely about choosing the abstractions that I run my business on. I would rather use open and composable abstractions than proprietary dead ends... UNLESS it’s a throw away experiment or skunkworks and even then I’d be skeptical.
Tell me what staff I am saving by running a database on a Kubernetes cluster vs. RDS or Cloud SQL? I still maybe need a DBA. I still need to have someone keep an eye on upgrades , availability, and Performance across the fleet (which aren’t necessarily automagical). I still am responsible for the overall security posture.
We aren’t talking about racking and stacking servers and switches here, we are talking about administrator to server or administrator to developer ratios. And while you certainly can have a terrible ratio with Kubernetes (it’s early days) you also can get that down to a very respectable number with good practices and modern tools + abstractions.
As an aside, monolithic EMR/EHR systems are the bane of most health systems’ existence and the vendors like Cerner and EPIC know they have to change (to various degrees) or they will be changed by the market.
Multi cloud undoubtedly saves me money and risk because (a) there is no large company in existence outside of AWS, Google and Microsoft that won’t be multi cloud due to M&A and risk posture (b) I can use the right tools for the job in the particular cloud if they do have a unique offering (c) I can play them off each other when prices and quality suffer. (D) I can use the OSS that is driving the industry rather than using expensive knock offs (looking at you Kinesis!)
This belief that you can pick one outsourcing vendor and you’ll be better sticking with it for all your needs until the end of time is the height of arrogant naivety and has been historically disproven. But every generation needs to learn the hard way.
I've been in companies that used managed service and running database on premises. As long as you used some kind of configuration management software and deployed tools like barman and repmgr, there wasn't any operational work. Only exception was a hardware failure, but that doesn't happen often and when it happens you spin a new machine and invoke few commands (these tools mentioned make it really easy) and are back in business. You also have more control over how replication and failover is set up (so it's tuned to your use case) or have ability to use all available postgres extensions. The only "good" thing about RDS is that management can't pressure you to do some things. For example you can say there must be a downtime to upgrade the database and there's not much that can be done about it.
You might run into performance issues, where you might need a DBA, but RDS doesn't solve this, that part is still exactly the same, and RDS gives you less control.
BTW: I also noticed that AWS really makes things so RDS service feels less attractive. For example if you want to create a serverless postgresql, I don't think you can do it from UI. I was only able to do it from cloudformation, the only version available is 10.7. If you allow instance to be put to sleep it requires 30 minutes to spin up, and AWS Gateway has 29s (yes 29s) maximum timeout, that can't be increased.
I think AWS wants to discourage use of postgres, and encourages everyone to use DynamoDB (which doesn't have 1:1 equivalent outside of AWS). I'm wondering if it's due to introduction of logical replication in version 10. The biggest asset of most businesses is the data, and not being able to get it out without a lengthy downtime might be the reason to keep businesses from moving.
This is just one small step in that direction. Once everyone is on the same page about Infrastucture-as-Data and behaves in a consistent and expected manner, we can move forward with wrapping these interface. Crossplane is a one such project.
Is there something like CloudFormation console for EKS/ACK?
last time I deleted a cluster it failed because of a NLB still being around but not accounted for as CFN resource, even though I provisioned the cluster with CFN.
One consequence is the accidental deletion of AWS things...
If a CRD is deleted the CRs described it are also deleted. So, deleting a CRD (even accidentally) could end up deleting resources in AWS (e.g., backups). So, be careful.
Some things being managed by Kubernetes would be really cool. Other things being managed by k8s could break things if something goes wrong. I would plan accordingly.
Hi I'm a developer advocate in the AWS Container organization. This question of how to handle resource destruction is an active issue on the project, and we'd welcome your comments (or those of anyone else) on this Github issue: https://github.com/aws/aws-controllers-k8s/issues/82
Our goal is to make this project have "no surprises" and therefore no unexpected destruction of resources. The specifics of how we mark resource as safe to delete instead of retaining by default are under discussion on that Github issue.
One way our team has dealt with this question in a highly serverless-oriented and infrastructure-as-code-driven (how's THAT for buzzword soup) environment is to explicitly separate stateful resources from stateless, while exposing reference hooks in a configuration store to cross from one to the other. We've found that doing so _greatly_ reduces the blast radius of mistakes and lets us move more quickly and confidently.
The stateless stacks generally have a lot of development activity going on, and rapidly iterate. This is where most of our code and logic lives. This is where the vast majority of our deployment (and related cloud configuration) activity happens.
All of that thrash is kept away from the stateful stacks - think S3 buckets or DynamoDB tables - where, if THOSE thrash, we potentially get an outage at best, or lose data at worst (backups notwithstanding).
We DO NOT WANT stateless oriented stacks to own the lifecycle for stateful stacks. They inherently need to be treated differently. Or, at least the impact of mistakes is different.
The trick comes when you need to tie them together. To do this, we've added CloudFormation hooks and other deployment time logic that publish ARN and other connectivity info to our configuration store. The stateless services look up config values either during deployment or at runtime and are able to find the details they need to reference the state resources they need access to.
We've poked at toolsets like Amplify that lump everything together and have already been bitten numerous times. We've found that the difference between stateful and stateless resources should not be papered over, but instead emphasized and supported explicitly by tooling.
... all of this being one team's experience over the years, of course.
Very curious to see how this paradigm evolves here!
[edit]…
Riffing on this just a little bit further… as I’m thinking about it here, it comes down to abstraction level. In a deployment or resource management domain, a generic “this is a cloud resource” isn’t very useful. What’s way _more_ useful is something like “this is a stateful resource” or “this is a stateless resource”, because that level describes resource behavior more clearly, AND how to interface with or manage those resources.
There are echos of code development principles here intentionally - robust cloud infrastructure management is mirrors software dev practices as much as infrastructure management ones!
Went a very sim route with Serverless and Terraform. Things we couldn't afford to lose went into Terraform(RDS instances, Kinesis, etc). AWS resources that were more coupled to the individual services and could be deleted and recreated went into the serverless configs.
A large reason for this was that we couldn't trust CloudFormation or Lambda to not require blasting away and recreating resources. At least at the time it was not uncommon for a stack to get "stuck" or a lambda function to stop having it's EINs properly configured(essentially stuck).
I sincerely hope this works well. CloudFormation has already many issues with unintended changes, visibility of the changes to be done, auto update/deletion of resources, etc. Having another path to provision infrastructure that is safe, consistent, and provides visibility at all phases would be great.
Also, not all AWS follow the same deletion semantics. Example: S3 buckets. The report as being deleted somewhat quickly, but their name may not be available again for hour or so.
In this case the delete will appear to succeed, but the recreation, if done with the same name, may fail.
Does the create during this time window return a specific enough error? This seems like the exact case where a controller that never gives up could provide value. Though I'm kind of amazed this is on the order of an hour instead of minutes.
This issue already exists with, say, Terraform to orchestrate infra with code. The solution is to append a random hex string to the resource unique identifier.
This AWS project will need to support a feature like that.
Exactly. Planning accordingly is the right answer.
Assume anyone can destroy your infrastructure at any time (by mistake or otherwise). This could be done with cloudformation, terraform, API calls, and essentially any automation(with different levels of safeguards).
Be prepared for that. Be careful with your data. Not so careful with individual servers - they should be cattle, not pets.
EDIT: If this is a production system, one could take away any 'delete' permissions until they are needed again.
"cattle, not pets" is a nice slogan, but I think most ranchers would be angry if a junior ranch hand accidentally poisoned the entire herd with a typo.
I think you have mostly the same problems in Terraform, Pulumi, or Cloudformation though right? Is there anything that makes it easier to accidentally do in k8s?
One layer of defense in all of these cases is keeping the IAM credentials that the configuration management tool uses from having any deletion permissions.
There are certain things that freak me out in IAC, like loosing a primary DB or secrets. I want IAC to create them, wire things together and index them, but the idea of an accidental mis-configuration deleting a production database gives me heartburn.
That's what you should have QA pipeline ready to check before you'll be able to apply the config, right?
Secrets should be encrypted in git to my best awareness.
I started building something along this line a few years ago. the ability to control AWS VM and treat them as pods (i.e. can backend a service, access other services) to have a hybrid (VM / Container) infrastucture that is all managed in the kubernetes way. Future work would have been to try and manage other resources similarly.
Sadly startup interest changed and then went under (but the freedom I was given to explore there was the best experience I have ever had)
On the one hand this could be such a cool and powerful concept.
On the other hand my brain segfaults on the recursive loop of how the layer-inversion gets modeled as IaC with a CI/CD pipeline. I guess if you were very strict about having your provider-infra layer (cloudformation/terraform) do only the bare minimum to get your kube environment up, and then within that kube environment you used something like ACK to provision any cloud-provider resources that your kube-managed apps/pipelines needed.
Yet another case where I'm like "I don't know if kube should be the answer to everything, but I sure as shit won't miss <x>".
Yes, the goal is very much to bring provisioning & operations of all resources on to the core platform we're using, Kubernetes. I mentioned in another reply how excited I am to be able to manage things like SQS queues now with my app, via Helm Charts, rather than need separate machinery to manage/operate my app & the various resources it needs.
And now there is a control loop. So if Ben in support accidentally deletes my queue, it'll get recreated.
Breaking away from the proprietary platform underlay is going to be great. Managing things more consistently is going to be really great.
Traditional Infrastructure-as-Code solutions were based on edge-triggers. Like create ingress, delete ingress. But what about when ingress misbehaves or is in an unrecoverable state?
Kubernetes introduced edge-triggered level-driven with resync reconciliation based "controllers". User defines a state and the controller does it's best to keep the infra in this desired state all the times. (Although Terraform has also moved to this same design in recent times)
This establishes a consistent experience. Everyone knows you just need to do kubectl get my-resource to check your desired state. All the issues will be logged in status and controller. You can combine multiple controllers to achieve your desired application design. For example, Knative has their own kind called "Service" which has some custom components, some inherited from istio and things like replicasets from default kubernetes controllers.
An in the thread he mentions Crossplane as the cross-cloud way to do this https://twitter.com/kelseyhightower/status/12963213771342315...