This shared responsibility principle that underlies cloud marketing speak sounds a lot like the self-driving mess we find ourselves in today - I.e. the responsibility boundary between parties exists in a fog of war and results in more exceptions than if one or the other were totally responsible.
We have been a customer of Amazon AWS for ~6 years now, and we still really only use ~3 of their products: EC2, Route53 and S3. I.e. the actual compute/memory/storage/network capacity, and the mapping of the outside world to it. Because we are a software company, we write most of our own software. There is no value to our customers in us stringing together a pile of someone else's products, especially in a way that we cannot guarantee will be sustainable for >5 years. We cannot afford to constantly rework completed product installations.
We strongly feel that any deeper buy-in with 3rd party technology vendors would compromise our agility and put us at their total mercy. Where we are currently positioned in the lock-in game, we could pull the ripcord and be sitting in a private datacenter within a week. All we need to do is move VMs, domain registrations and DNS nameservers if we want to kill the AWS bill.
I feel for those who are up to their eyeballs in cloud infrastructure. Perhaps you made your own bed, but you shouldn't have to suffer in it. These are very complex decisions. We didn't get it right at first either. Maybe consider pleading with your executive management for mercy now. Perhaps you get a shot at a complete redo before it all comes crashing down. We certainly did. It's amazing what can happen if you have the guts to own up to bad choices and start an honest conversation.
I would also be interested to hear the other side of the coin. Who out there is using 20+ AWS/Azure/GCP products to back a single business app and is having a fantastic time of it?
I've worked with a number of teams over the last few years who use AWS and I'd say from top to bottom they all build their strategy more or less the same way:
0. Whatever is the minimum needed to get a VPC stood up.
1. EC2 as 90%+ of whatever they're doing
2. S3 for storing lots of stuff and/or crossing VPC boundaries for data ingress/egress (like seriously, S3 seems to be used more as an alternative to SFTP than for anything else). This makes up usually the rest of the thinking.
3. Maybe one other technology that's usually from the set of {Lambda, Batch, Redshift, SQS} but rarely any combination of two or more of those.
And that's it. I know there are teams that go all in. But for the dozen or teams I've personally interacted with this is it. The rest of the stack is usually something stuffed into an EC2 instance instead of using an AWS version and it comes down to one thing: the difficulties in estimating pricing for those pieces. EC2 instances are drop-dead simple to price estimate forward 6 months, 12 months or longer.
Amazon is probably leaving billions on the table every year because nobody can figure out how to price things so their department can make their yearly budget requests. The one time somebody tries to use some managed service that goes overbudget by 3000%, and the after action figures out that it would have been within the budget by using <open source technology> in EC2, they just do that instead -- even though it increases the staff cost and maintenance complexity.
In fact just this past week a team was looking at using SageMaker in an effort to go all "cloud native", took one look at the pricing sheet and noped right back to Jupyter and scikit_learn in a few EC2 instances.
An entire different group I'm working with is evaluating cloud management tools and most of them just simplify provisioning EC2 instances and tracking instance costs. They really don't do much for tracking costs from almost any of the other services.
I bet cloud providers are incentivized not to provide detailed billing/usage stats. I remember having to use a 3rd party service to analyze our S3 usage.
Infinite scalability is also a curse - we had a case where pruning history from an S3 bucket was failing for months and we didn’t know until the storage bill became significant enough to notice. I guess in some ways it is better than being woken up in the middle of night but we wasted millions storing useless data
Azure also has similar issues - deleting a VM sometimes doesn’t cleanup dependent resources and it is a mess to find and delete later - only because the dependent resources are deliberately not named with a matching tag.
People don't like to admit it, but in many circumstances, having a service that is escalating to 10x or 100x its normal demand go off line is probably the desirable thing.
There seem to be plenty of successful online businesses that share some information about their back end infrastructure, and when you look it turns out they just have a handful of web servers running their app behind a load balancer, enough DB servers for the throughput and redundancy they need and some sort of failover arrangements in case something dies, and some sort of cache and/or CDN arrangements just for efficiency. Even if they're running in the cloud, it's probably just a load of manually allocated EC2 instances, maybe RDS to save the hassle of managing the database manually, and maybe S3.
I wonder how many businesses exist in the entire world that truly need to scale their server counts up or down by a factor of several so quickly that it has to be done automatically and not as a result of, say, daily human reviews of what's happening. I feel like even large businesses that need to run thousands of servers at all are probably relatively rare, as indeed large businesses themselves are. The number that might realistically need to scale from say 1,000 web servers to 1,500 within a matter of hours must be even rarer. But I have nothing even resembling a useful data source to tell whether this intuition is correct.
This was the key sentence, I think. This type of problem actually shows up in other domains as well, queueing theory comes immediately to mind. Even the halting problem is only a problem with infinite tape, and becomes easier with (known?) limited resources.
When you have some parameter that is unbounded you need to add extra checks to bound them yourself to some sane value. You are right, in that the parent failed to monitor some infrastructure, but if they were in their own datacenter, once they filled their NAS, I’m positive someone would have noticed, if only because other checks, like diskspace are less likely to be forgotten.
Also, getting a huge surprise bill is a downside of any option, and the risk needs to be factored into the cost. I’m constantly paranoid when working in a cloud environment, even doing something as trivial as a directory listing from the command line on S3 costs money. I had a back and forths with AWS support just to be clear what the order of magnitude of the bill would be for a simple cleanup action since there were 2 documented ways to do what I needed, and one appeared to be easier, yet significantly more expensive.
AWS monitoring (and billing) is garbage because they make an extraordinary amount of money on unintentional spend.
"But look at how many monitoring solutions they have in the dashboard! Why, just last re:invent they announced 20 new monitoring features!"
They make a big fuss and show about improving monitoring but it's always crippled in some way that makes it easy to get wrong and time-consuming or expensive to get right.
I'm not very familiar with AWS or The Cloud, but I'm having trouble understanding what you said about Amazon leaving money on the table by not directing customers toward specific-purpose services as opposed to EC2?
Wouldn't (for AWS to make a profit anyway) whatever managed service have to be cheaper than some equivalent service running on an EC2 VM?
I get the concerns re: pricing and predictability, but it still seems like more $$$ for AWS.
No, usually the managed services are a premium over the bare hardware.
When you use RDS for example, you’re paying for the compute resources but also paying for the extra functionality they provide and their management and maintenance they’re doing for you.
You can run your own Postgres database, or you can pay the premium for Aurora on RDS and get a multi-region setup with point in time restore and one-click scaling and automatically managed storage size and automatic patching and monitoring integrated into AWS monitoring tools and...
They’re leaving money in the table because instead of using “Amazon Managed $X” potentially at a premium or paying a similar amount but in a way where AWS can provide the service with fewer compute resources than you or I would need because of their scale and thus more profitably, people look and see they’ll be paying $0.10/1000 requests and $0.05/1gb of data processed in a query and $0.10/gb for bandwidth for any transfer that leaves the region and... people just give up and go “I have no idea what that will cost or whether I can afford it, but this EC2 instance is $150/mo, I can afford that.”
It's not just about a straight cost comparison. It's about how organizational decision-making works.
The people shopping for products are not spending their own money, but they are spending their own time and energy. The people approving budgets are not considering all possible alternatives, they are only considering the ones that have been presented to them by the people doing the shopping.
If the shoppers decide that an option will cost them too much in time and irritation, then it may be that the people holding the purse-strings are never even made aware that it exists. Even if it is the cheapest option.
This is a really good summary of the situation, and I'd add a bit about risk:
It's relativity easy to estimate EC2 costs for running some random service, because it's literally just a per-hour fee times number of instances. If you're wrong, the bigger instance size or more instances isn't that much more expensive.
For almost every other service, you have to estimate some other much more detailed metric: number of http requests, bytes per message, etc. When you haven't yet written the software, those details can be very fuzzy, making the whole estimation process extremely risky - it could be cheaper than EC2, it could be 10x more, and we won't really know until we've spent at least a coulple months writing code. And let's hope we don't pivot or have our customers do anything in a way we're not expecting..
Yeah good question. Sibling comments to this one explain it well, but basically AWS managed services come at a premium price over some equivalent running in just EC2. (Some services in fact do charge you for EC2 time + the service + storage etc.)
"Managed" usually means "pay us more in exchange for less work on your part". This is usually pitched as a way to reduce admin/infrastructure/devops type staff and the overhead that goes along with having those people on the payroll.
Managed Airflow Scheduler on AWS with "large" size costs $0.99/hour, or $8,672/year per instance. That's ~ $17,500 considering Airflow for at least non-prod and prod instances.
Building it on your own on same size EC2 instance would cost $3,363/year for the EC2. Times two for two environments, let's say $6,700. $4,000 if you prepay the instance.
That looks way cheaper, but then you have to do the engineering and the operational support yourself.
If you consider just the engineering and assume engineer costs $50/hour and estimate this to initial three weeks of work and then 2.5 days / month for support (upgrades, tuning, ...) that's extra $4,000 upfront and $1,000/month.
So on AWS you're at $17,500/year and on-prem you're at best $20,000 first year and $16,000 next years.
So the AWS only comes a bit more expensive - but the math is tricky on several parts:
- maybe you need 4 environments deployed instead of 2, which is more for AWS but not much more for engineering?
- maybe there's less sustaining cost because you're ok with upgrading Airflow only once a quarter?
- you probably already pay the engineers, so it's not an extra money cost, it's extra cost of them not working on other stuff - different boxes and budgets
- maybe you're in part of a world where good devops engineer doesn't cost $50/hour but $15 hour
- I'm ignoring cost of operational support, which can be a lot for on-prem if you need 24/7
- maybe you need 12+ Airflow instances thanks to your fragmented / federated IT and can share the engineering cost
- etc, etc.
So I think what OP was saying is that if AWS priced Managed Airflow at $0.5 per hour, it would be no brainer to use instead of build your own. The way it is, some customers will surely for their own Airflow instead, because the math favors it.
Management decided get rid of all the on-prem, capex-heavy HVAC systems (that are depreciating as-we-speak!) for budget-friendly cloud thermostats? Maybe they can send you the climate-controlled blown air on your office's Amazon Day. ;)
> That looks way cheaper, but then you have to do the engineering and the operational support yourself.
In my experience, this is the piece that engineers rarely realize and that is actually one of the biggest factors in evaluating cloud providers vs. home-rolled. Especially if you're a small company, engineering time (really any employee time) is _insanely valuable_. Valuable such that even if Airflow is cash-expensive, if using it allows your engineers to focus on building whatever makes _your business successful_, it is usually a much better idea to just use Airflow and keep moving. Clients usually will not care about whether you implemented your own version of an AWS product (unless that's your company's specific business). Clients will care about the features you ship. If you spent a ton of time re-inventing Airflow to save some cost, but then go bankrupt before you ever ship, rolling your own Airflow implementation clearly didn't save you anything.
If you just run redis on virtual server in cloud, you're not replacing the devops engineer who manages redis. You're replacing:
- the network engineers who manage switches, firewalls, routers, nats and connectivity
- the people who manage on-prem hardware, install new servers, replace failing servers, and switch broken disks for working ones and install base os using some ilom
- the whole process for ordering that hardware and getting it delivered on time - including knowing when it's going to be necessary
- if your on-prem had virtual servers, the people who manage and operate vmware
- if your on-prem had SAN, the people who manage and operate that SAN (and buy new disks and take care of capacity planning)
Some of those things you still have to do - for example configure firewall, or say "I want to take a snapshot of this virtual disk drive" but instead of doing a difficult technical thing, you can do it in web ui - but you still need to do what to do.
And of course, if you never had SAN and virtual servers and two data center with your team managing the interconnect between private networks, there's a lot of stuff that the Cloud could give you that you probably don't need.
Now if you move to managed redis, you're also replacing the person who installs and patches the linux the redis runs on, and the one who installs, backups and configures the redis. And you get the redis on button click, so if you suddenly need three more, you're also replacing the person who automates building redises.
You are right, that it is not that the operational support is just gone. Some of it is gone. Some of it is replace by doing cloud stuff (like thinking of buying reserved instances instead of actual physical servers). Some of it is just more efficient.
Now if any of this doesn't fit you're use case because you have too small or too large scale, then the Cloud is of course a bad idea.
You've just wrapped a very specific definition of "Operational support" into a very vague term.
From another post in this thread: If you want to deploy distributed stream processing like Apache Kafka, but do not want to roll it yourself, you can use a tool like AWS Kinesis or Azure Event Hubs.
You still need "Operational Support" to manage EH/Kinesis, but generally it's closer to the type of "Operational Support" that can be provided by a general backend software engineer, as opposed to a DevOps/Infrastructure-specific dev. By using a Cloud (Managed) service, you're removing the need to manage:
* Uptime management
* Scaling
* Custom-multi-AZ
And probably a lot more. Sure, you've still have to actually handle events appropriately, but you have to do that either way.
The only caveat is that this goes for founders or engineers who are financially tied with the company success. If the engineer just collects paycheck, they might prioritize fun - and I feel that might be behind a lot of the "reinventing the wheel" efforts you see in the industry.
Especially if you're a small company, engineering time (really any employee time) is _insanely valuable_.
This is true, but it is balanced by the fact that uncertainty can be insanely expensive. And diving into complicated cloud infrastructure with a small business, if you're not already an expert on it, is a very uncertain endeavour in terms of whether you'll get everything set up right (and not find out otherwise at 3am when it turns out your redundancy wasn't, for example) and what everything will cost. By the time you have either become an expert yourself or hired someone who already is, your costs have already increased significantly too, for exactly the reason you've just stated yourself.
> And diving into complicated cloud infrastructure with a small business, if you're not already an expert on it, is a very uncertain endeavour in terms of whether you'll get everything set up right
I do not agree. The entire point of using cloud offerings as opposed to rolling your own, is cloud offerings are usually several orders of magnitude easier to configure. Using Event Hub, as an example, means that you're getting a similar experience to Apache Kafka, but without having to scale/configure everything yourself.
Sure, you have to become proficient with Event Hub, but becoming proficient with EH is probably 1/100th of the difficulty as becoming proficient enough in Kafka to support the same scalability/workload.
All I can say is that this hasn't been my experience. Setting up a single server is much the same whether it's on-prem or at a colo facility or some VM in the cloud, but the amount of added complexity to set up non-trivial networking and security and redundancy/failover and backups and all that stuff in the cloud is far more complicated -- if you don't already know how to do it -- than just setting up a few servers directly. The only exceptions IME tend to be the most basic and essential services, like effectively infinitely scalable storage and managed databases. Not coincidentally, I suspect, these are the kinds of cloud services that almost everyone uses, and often (as this very HN discussion demonstrates) the only ones.
There is still considerable value in the on-demand nature of cloud hardware, instead of waiting for stuff like physically assembling new servers and then shipping them and then installing them in the rack all before you can even start setting up the software, but IME the simplicity emperor has no clothes. Just look at the number of anecdotes about even quite large and well-established businesses that have had systems go down because they didn't fully understand AZs and failover arrangements, or have been landed with some crazy high bill because they didn't fully understand all the different things they were going to be charged for, or have been breached because they left some S3 bucket open.
Watching your discussion I think the truth might be somewhere in the middle - the Cloud can do things for you but you still need to learn how to use it.
So once you know your way around, it can be a force multiplier but learning cloud can be as much work as learning how to do it on your own.
Or said from a different angle, it help you outsource (part os) operations but it does not actually help you to engineer and architect things correctly, and you can shoot yourself in the foot pretty easily.
Personally, I think the cloud is most transformative for large companies with dynosaur internal IT, where it brings a lot of engineer self-service into the picture - where I work I can have RDS in minutes, or internally provisioned on-prem Oracle in a week, and the Oracle ends up being more expensive because 24/7 support has to be ordered from a specific list of vendors ... but that's not going to be the case in agile company with strong engineering culture.
I suspect much or all of this is true, including the observation about large companies with dinosaur IT. The comment I originally replied to was specifically talking about the environment in small companies, which is the environment I'm most familiar with, and my comments should be read in that context.
In my own businesses or the clients we work with, we too could add RDS quickly for something we were hosting on AWS. On the other hand, we could also spin up a pair of Postgres instances and configure streaming replication quickly if we were running on-prem or colo. Each approach has its pros and cons, and we'd look at each situation on its merits and choose accordingly. But IME, it's more like choosing from a restaurant menu depending on what looks most appealing at the time. The way some people talk about cloud, it's like they see it as choosing between a gourmet restaurant with a Michelin-starred team and the local burger van.
> but the amount of added complexity to set up non-trivial networking and security and redundancy/failover and backups and all that stuff in the cloud is far more complicated
This complexity exists either way, is my point. Whether you're managing your own servers, or using barebones cloud VMs, or using a bunch of cloud fanciness, the complexity you just defined still exists. And if that complexity is a constant, why is it only being used as a negative against cloud services?
> Just look at the number of anecdotes about even quite large and well-established businesses that have had systems go down because they didn't fully understand AZs and failover arrangements, or have been landed with some crazy high bill because they didn't fully understand all the different things they were going to be charged for, or have been breached because they left some S3 bucket open.
If your argument is "It's not better when done badly", definitely, I agree, because what is?
I guess, my overall point is that cloud-based infrastructure shifts your focus. Yes, you have to know how to configure cloud resources, but in 2021, do you think it's easier to find people with AWS experience, or people with custom in-house or colo server management experience?
The thing is, I don't think the complexity is even close to the same in the two cases.
AWS and similar services are an abstraction over hardware, software and networking all at once. There are well over 100 different services available on AWS alone. Just to get a basic configuration up and running, someone new to the system has to figure out which of those services they actually need, which is a barrier in itself given the obfuscated names they have.
Then you have much the same network and security details to set up as you would have for on-prem or colo infrastructure, but now with lots of non-standard terminology and proprietary UIs, which are so cumbersome at times that an entire generation of overcomplicated "orchestration" tools has been developed, each of which typically adds yet another layer of leaky abstraction.
Hopefully some time before this all happened you tried to work out what it was going to cost, and maybe you were close or maybe you are in for a nasty surprise because those handy managed services cost several times what the equivalent server + software would have cost either running on real servers or just on cloud VMs.
And if you fall back to that latter case as your safe default, you still get all the same issues to deal with as you would have had on your own servers and network, except that now you need to figure out what is really going on behind all those virtualised systems and quasi-geographical organisation layers before you can tell whether one unfortunate event could take down all the instances of any vital services you need.
In comparison, literally every small business I have ever worked in as a tech worker has had several people at the office who were perfectly capable of buying a switch or firewall or router and spending the few minutes required to configure it or buying a server and installing Linux/Windows and then whatever server software it needed again very quickly. Cloud systems can make it faster to deploy new hardware and connectivity, because you save the time required for all the physical steps, but after that the time and knowledge required to get a small network's worth of equipment up and running really isn't that great. After all, we used to do that all the time until the cloud hype to hold, and it's not as if that has suddenly stopped working or all the people with that knowledge suddenly left the industry in the past 5 years.
> The thing is, I don't think the complexity is even close to the same in the two cases.
Agreed (but probably on the opposite end as you)
It seems a lot like you've been scorned in the past and that's driving a lot of your statements now (which is totally fine and fair). I'm trying to bring up that, for every problem you've just defined, the literal exact same problem exists for colo/managed servers, except it is now also your problem to keep the lights on and the machine running.
> literally every small business I have ever worked in as a tech worker has had several people at the office who were perfectly capable of buying a switch or firewall or router and spending the few minutes required to configure it or buying a server and installing Linux/Windows and then whatever server software it needed again very quickly.
I'm sorry, if you believe that building and deploying production-ready server infrastructure is as easy as "Just going out and buying a switch and spending a few MINUTES installing linux" (emphasis mine) - I feel like we aren't talking about the same thing at all. Not even close.
Not scorned, just a little bored of being told by advocates how the cloud will do wonders for my businesses or my clients and then seeing the end results not live up to the hype.
I'm not saying there are no benefits to cloud deployment. It does have some clear advantages, and I've repeatedly cited the rapid deployment of the hardware and connectivity in this very discussion, for example. It hasn't come up much so far in the parts of the discussion I've been in, but I would never claim there is no-one with a good use for the more niche services among the hundreds available from the likes of AWS, either.
However, I mostly operate in the world of smaller businesses, and in this world simplicity is king when it comes to infrastructure. We are interested in deploying our software so people can run it, whether that's for a client's internal use or something that's running online and publicly accessible. Setting up a new server is mostly a case of installing the required OS and hosting tools, and then our software will take over (and that work would be essentially the same wherever it is hosted), once you have the hardware itself installed and connected. Configuring a new office network is something you'd probably do in a day, again once the physical aspects have been completed. You slightly mangled the timescales I actually suggested in your quote, BTW.
These systems are often maintained by a small number of team members who have the relevant knowledge as a sideline to their day jobs. And this approach has been working for decades, and continues to work fine today. Perhaps I have just never met the bogeyman where the operational requirements to maintain the IT infrastructure for a small business (say up to 50 people) are somehow unmanageable by normal people with readily available skills in a reasonable amount of time, so the arguments about somehow radically improving efficiency by outsourcing those aspects to cloud services have never resonated much with me. It's a big world, and of course YMMV.
I recently inherited a product that was developed from the ground up on AWS. It's been a real eye opener.
Yes, it absolutely is locked in, and will never run on anything but AWS. That doesn't surprise me. What surprises me is all of the unnecessary complexity. It's one big Rube Goldberg contraption, stringing together different AWS products, with a great deal of "tool in search of problem" syndrome for good measure. I am pretty sure that, in at least a few spots, the glue code used to plug into Amazon XYZ amounted to a greater development and maintenance burden than a homegrown module for solving the same problem would have been.
NIH syndrome is certainly not any fun. But IH syndrome seems to be no better.
>I would also be interested to hear the other side of the coin. Who out there is using 20+ AWS/Azure/GCP products to back a single business app and is having a fantastic time of it?
Netflix uses a lot of AWS higher-level services beyond the basics of EC2 + S3. Netflix definitely doesn't restrict its use of AWS to only be a "dumb data center". Across various tech presentations by Netflix engineers, I count at least 17 AWS services they use.
Certain services IMHO have to be discounted from this list:
- VPC - basic building block for any AWS-based infra that isn't ancient
- CloudTrail - only way to get audit logs out of AWS, no matter what you feed them into
- CloudWatch - similar with CloudTrail, many things (but not all) will log to CloudWatch, and if you use your own log infra you'll have to pull from it. Also necessary for metrics.
- ELB/ELBv2/NLB/ALB - for many reasons they are often the only ways to pull traffic to your services deployed on AWS. Yes, you can sometimes do it another way around, but you have high chances of feeling the pain.
My personal typical set for AWS is EC2, RDS, all the VPC/ELB/NLB/ALB stack, Route53, CloudTrail + CloudWatch. S3 and RDS as needed, as both are easily moved elsewhere.
I don't think you can discount them like that. Maybe they aren't as front of mind as services like S3, EC2, etc, but if you were to try to rebuild your setup in a personal data center, replacing the capabilities of VPC, IAM, CloudTrail, NAT gateways, ELBs, KMS etc would be a huge effort on your part. The fact that they are "basic building blocks" makes them more important, not less. In a discussion about the complexity of cloud providers versus other setups, that seems especially relevant.
Oh, I meant it more in terms of "can you count on them as optional services".
Because they aren't optional, and yes, it takes non trivial amount to replicate them... but funnily enough, several of them have to be replicated elsewhere too.
NAT gateways usually aren't an issue, KMS for many places can be done relatively quickly with Hashicorp Vault.
IAM is a weird case, because unless you're building a cloud for others to use it's not necessarily that important, meanwhile your own authorization framework is necessary even on AWS because you can't just piggy back on IAM (I wish I could).
So your company cautiously chooses which services in AWS to use, and sticks to infrastructure offerings for now. Netflix called it "paved path", and it worked really well too for Netflix. Over the years, though, the "paved path" expanded and extended to more services. It's worth noting that EC2 alone is a huge productivity booster, bar none. Nothing beats setting up a cluster of machines, with a few clicks, that auto scales per dynamic scaling policies. In contrast, Uber couldn't do this for at least 5 years, and their docker-based cluster system is a crippled for not supporting the damn persistent volumes. God knows how much productivity was lost because of the bogus reasons Uber had for not going to cloud.
We carefully select and use PaaS and managed cloud services to construct our infrastructure with. This allows us to maximize our focus on what our customers are paying for: creating software for them which will typically be in use for 5+ years.
We spend close to zero time on infrastructure maintenance and management, we pay others to do this for us, cheaper and more reliable. Having to swap out one service for another hasn't given us any trouble or unreasonable costs yet in the past 5 years.
Unlike the article is trying to convince us of, it has massively reduced complexity for us.
> There is no value to our customers in us stringing together a pile of someone else's products
Maybe not your business, but there are many businesses in which this is exactly what happens. Any managed-service is just combining other people's work into a "product" that gets sold to customers. And that's great! AWS has a staggering amount of products, and lots of business don't even want to have to care about AWS.
> Who out there is using 20+ AWS/Azure/GCP products to back a single business app and is having a fantastic time of it?
Several times. I think cloud products are just tools to get you further along in your business. Most of the tools I use are distributed systems tools, because I don't want to have to own them, and container runtimes/datastores. Every single thing I've ever deployed across AWS/Azure is used as a generic interface that could be replaced relatively easily if necessary, and I've used Terraform to manage my infrastructure creation/deployment process, so that I can swap resources in and out without having to change tech.
If, for some reason, Azure Event Hub stopped providing what we needed it for, we could certainly deploy a customized Kafka implementation and have the rest of our code not really know or care, but from when we set out to build our products, that has always been a "If we need to" problem, and we've never needed to.
It depends on your company. For new startups, I'd stick with delivering your product over building your own cloud services on top of EC2/S3.
At my last startup, the engineering head shared the Not-Invented-Here view. We couldn't use any handy services. We had to run our own Cassandra and everything else services. It was a huge time sink for a small team that didn't deliver differentiating value.
At my current startup, we're almost 100% serverless on SaaS providers. We operate a thousand or more nodes with no ops team (a couple engineers know ops, when needed). We leave the complexity of scaling an maintenance to our cloud vendor's autoscaling services. Sure, we could reduce our opex in places by trying to roll our own, but the ability to have the vast majority of engineers delivering customer product, rather than reinventing the wheel is more valuable to us than fear of vendor lock-in. If our cloud vendor wanted to jack rates, we'd take the necessary action, which we've done before when we banged a service in a week that a vendor was going to hold us over a barrel for, then dropped the vendor.
> The one time somebody tries to use some managed service that goes overbudget by 3000%, and the after action figures out that it would have been within the budget by using <open source technology> in EC2, they just do that instead
This impacts casual dabblers too. More than once I've seen HN comments on how someone is wary to experiment with cloud computing because a single screw-up can lead to an essentially unlimited bill. Judging by HackerNews anecdotes of when this does happen (unexpected overruns of thousands of dollars), there's a reasonable chance Amazon will refund you out of good will, but that's not enough to lay the fears to rest.
Linode let you pre-pay, for instance, but (to my knowledge) this isn't an option offered by Amazon, Microsoft, or Google.
You seem to have made up your mind, but for the benefit of others: yes, we use many AWS products and are having a fantastic time of it. AWS services are more reliable, less buggy and have more stable APIs than any of the alternatives. Specific services that made a difference to us are Fargate, Spot, Batch, SQS/SNS, Lambda, ECR, EFS, ELB/ACM, CloudWatch Logs, IAM, Aurora RDS, SSM/Secrets Manager, and Step Functions - in addition to EC2/VPC/EBS, Route53 and S3 that you already listed. Each of these services does a lot to free us up to do more value added domain-specific work.
Terraform has emerged as a key tool to manage AWS resources - so much so that it really adds a lot to the value prop of AWS itself. I can do stuff with Terraform that was only aspirational until it existed.
Personally, I wouldn't plan on using on-prem except for niche applications that involve heavy data streams from local hardware. In the time that I've spent with companies working on AWS, I've seen a number of other companies waste lots of time and resources on heterogeneous strategies while complaining about their AWS bill - which was high because someone got so fed up with IT dysfunction, they went and used AWS but left behind inefficiently configured resources that were on all the time. Cloud often ends up being an escape hatch for teams that are not adequately served by dogmatic IT departments.
The key glue to a lot of what you mentioned missing in your list to make a lot of this work for outside consumers: API Gateway, arguably their worst product.
I agree, they really need to replace that product. It's not built to the standard that users expect from AWS. They have burned a lot of goodwill on that one.
That said, it's only truly necessary for Lambda and while it's frustrating and painful there, it's usually not a complete showstopper.
I'm in two minds about this (deeper integration with a particular vendor - i.e. "serverless")
Reduced time to market is incredibly valuable. Current client base is well in its millions. Ability to test to few and roll out to many instantly is invaluable. You no longer have to hire competent software developers who understand all patterns and practices to make scalable code and infrastructure. Just need them to work on a particular unit or function.
The thing which scares me is, some of these companies are decades of years old, hundreds. How long has AWS/GCP/Azure abstractions been around for? How quick are we to graveyard some of these platforms. Quite. A lot quicker than you can lift, shift and rewrite your solution to elsewhere.
It's always a trade-off though. You say you write most of your own software, but that's probably not true for, say your OS or programming language, or editors, or a million other things. Cloud software is the same; you might not be producing the most value if you spend your engineering hours (re)creating something you could buy.
In my own experience:
- AWS SNS and SQS are rock solid and provide excellent foundations for distributed systems. I know I would struggle to create the same level of reliability if I wrote my own publish-subscribe primitives and I've played enough with some of the open source alternatives to know they require operational costs that I don't want to pay.
- I use EC2 some of the time (e.g. when I need GPUs), but I prefer to use containers because they offer a superior solution for reproducible installation. I tend to use ECS because I don't want to take on the complexity of K8S and it offers me enough to have reliable, load-balanced services. ECS with Fargate is a great building block for many, run-of-the-mill services (e.g. no GPU, not crazy resource usages).
- Lambda is incredibly useful as glue between systems. I use Lambda to connect S3, SES, CloudWatch, and SQS to application code. I've also gone without Lambda on the SQS side and written my framework layers to dispatch messages to application code. This has advantages (e.g. finer-grain backoff control) but isn't worth it for smaller projects.
- Secrets manager is a nice foundational component. There are alternatives out there, but it integrates so well with ECS that I rarely consider them.
- RDS is terrific. In a past life, I spent time writing database failover logic and it was way too hard to get right consistently. I love not having to think about it. Plus encryption, backup, and monitoring are all batteries included.
- VPC networking is essential. I've seen too many setups that just use the default VPC and run an EC2 instance on a public IP. The horror.
- I've recently started to appreciate the value of Step Functions. When I write distributed systems, I tend to end up with a number of discrete components that each handle one part of a problem domain. This works, but creates understandability problems. I don't love writing Step Functions using a JSON grammar that isn't easy to test locally, but I find that the visibility they offer in terms of tracing a workflow is very nice.
- CloudFront isn't the best CDN, but it is often good enough. I tend to use it for frontend application hosting (along with S3, Route53, and ACM).
- CloudWatch is hard to avoid, though I rather dislike it. CloudWatch rules are useful for implementing cron-like triggers and detecting events in AWS systems, for example knowing whether EC2 failed to provision spot capacity.
- I have mixed feeling about DynamoDB as well. It offers a nice set of primitives and is often easier to starting use for small projects than something like RDS, but I rarely operate at the scales where it's a better solution than something like RDS PostgreSQL with all the terrific libraries and frameworks that work with it.
- At some scale, you want to segregate AWS resources across different accounts, usually with SSO and some level of automated provisioning. You can't escape IAM here and Control Tower is a pretty nice solution element as well.
I'm not sure if I'm up to 20 services yet, but it's probably close enough to answer your question. There are better and worse services out there, but you can get a lot of business value by making the right trade-offs, both because you get something that would be hard to build with the same level of reliability and security and because you can spend your time writing software that speaks more directly to product needs.
As for "having a fantastic time", YMMV. I am a huge fan of Terraform and tend to enjoy developing at that level. The solutions I've built provide building blocks for development teams who mostly don't have to think about the services.
Given the situation with Amazon Essentials, anyone who sells a product made substantially of stringing together AWS technology may find themselves cut out of the picture soon enough.
Many other companies have ended up competing with their own vendors. The only thing unique about Amazon is that they are boiling the frog much more slowly than previous generations of companies have.
I second that. It's not only that you make yourself completely intertwined with a Cloud by using more than fundamental services.
The costs of lambda or even DDB are IMMENSE. These only pay off for services that have a high return per request. I.e. if you get a lot of value out of lambda calls, sure, use them. But for anything high-frequency that earns you little to nothing on its own, forget about it.
Generally all your critical infrastructure should be Cloud independent. That narrows your choices largely to EC2, SQS, perhaps Kinesis, Rout53, and the like. And even there you should implement all your features with two clouds, i.e. Azure and AWS, just to be sure.
The good news is also the bad news. There are effectively only two options: Azure or AWS. Google Cloud is a joke. They arbitrary change their prices, terminate existing products, offer zero support. It's just like we have come to love Google. They just don't give a shit about customers. Google only cares about "architecture", i.e. how cool do I feel as engineer having built that service. Customer service is something that Google doesn't seem to understand. So think carefully if you want to buy into their "product". Google, literally, only develops products for their own benefit.
A very long time ago
App engine went out of beta and there was a price hike leaving many scrambling. App engine was in beta so long that many people didn’t think that label meant anything.
Google Cloud has quite good support and professional services.
I’ve worked with them for 3 years and can’t think of any services that have been killed.
They are very customer focused. From my perspective as a partner cloud services are more built for customer use cases than Google internal use cases. GKE and Anthos for example.
The optionality of being cloud agnostic comes with a huge cost, both because of all the pieces you have to build+operate and because of the functionality you have to exclude from your systems.
I am sure there are scales where you either have such a large engineering budget that you can ignore these costs or where decreasing your cloud spend is the only way to scale your business. But for the average company, I can't see how spending so much on infrastructure (and future optionality) pays off, especially when you could spend on product or marketing or anything else that has a more direct impact on your success.
> But for the average company, I can't see how spending so much on infrastructure (and future optionality) pays off, especially when you could spend on product or marketing or anything else that has a more direct impact on your success.
If you change "average company" to "average startup" then your point make sense. But for a normal company not everything needs to make a direct impact on your success. For example guaranteeing long term business continuity is an important factor too.
Unless you’re planning for the possibility of AWS dropping offline permanently with little to no notice, it really feels like you’re just paying a huge insurance premium. Like any insurance, it’s down to whether you need insurance or could cover the loss. Whether you’d rather incur a smaller ongoing cost to avoid the possibility of a large one time loss.
If AWS suddenly raised their prices 10x overnight, it would hurt but not be an existential threat for most companies. At that point they could invest six months or a year into migrating off of AWS.
Rough numbers that would end up costing us like $4m in cloud spend and staff if we retasked the entire org to accomplishing that for a year.
There’s certainly an opportunity cost as well, but I’d argue it’s not dissimilar to the opportunity cost we’d have been paying all along to maintain compatibility with multiple clouds.
Obviously it’s just conjecture, but my gut says the increased velocity of working on a single cloud and using existing Amazon services and tools where appropriate has made us significantly more than the costs of something that may never happen.
Plus I've seen more than a few efforts at multi-cloud that resulted in a strong dependency on all clouds vs the ability to switch between them. So not only do you not get to use cloud-specific services, you don't really get any benefit in terms of decoupling.
There are obviously plenty of companies that are willing to couple themselves to a single cloud vendor (e.g. Netflix with AWS) and plenty of business continuity risks that companies don't find cost effective to pursue. Has anyone been as vocal about decoupling from CRM or ERB systems as they are with cloud?
My own view is that these kinds of infrastructure projects create as many risks as the solve and happen at least as much because engineers like to solve these kinds of problems than for any other reason.
> The optionality of being cloud agnostic comes with a huge cost, both because of all the pieces you have to build+operate
This sounds like cloud vendor kool aid to me. Nearly every cloud vendor product above the infrastructure layer is a version of something that exists already in the world. When you outsource management of that to your cloud vendor you might lose 50% of the need to operationally manage that product but about 50% of it is irreducible. You still need internal competence in understanding that infrastructure and one way or another you're going to develop it over time. But if its your cloud vendor's proprietary stack then you are investing all your internal learning into non-transferrable skills instead of ones that can be generalised.
I can run PostgreSQL myself but there's a ton that goes into running it with redundancy, backup, encryption, etc that takes deep expertise to well. I know from experience that it's easy to get database failover wrong. I could probably cobble something together, but it would be mediocre and wouldn't handle network partitions reliably. On the other hand, RDS is used at a scale well beyond what I could afford to build out and has benefitted from much more usage. Problems that are low probability for me are regular events for RDS and have that experience built in.
Ultimately, you are paying a cloud vendor for service value and operational experience. Some services aren't as good as others in both regards, but for the ones that are good, the overhead of doing it yourself is an exponent, not a fraction.
To be honest, I would probably fit RDS more into the infrastructure layer than the application layer. I think there is value in having things at that level be managed.
I have never used Pulumi, I've used Terraform a bit. I like Terraform so this isn't a dig at the tools.
Abstracting your cloud provider is similar to abstracting your database. At a high-level they appear to be the same, they do similar things however the are very different when you get to the fine detail.
Pulmi / Terraform are useful in the provide a common language / API between cloud providers however you will never just switch cloud provider with changing a few lines of code.
We have been a customer of Amazon AWS for ~6 years now, and we still really only use ~3 of their products: EC2, Route53 and S3. I.e. the actual compute/memory/storage/network capacity, and the mapping of the outside world to it. Because we are a software company, we write most of our own software. There is no value to our customers in us stringing together a pile of someone else's products, especially in a way that we cannot guarantee will be sustainable for >5 years. We cannot afford to constantly rework completed product installations.
We strongly feel that any deeper buy-in with 3rd party technology vendors would compromise our agility and put us at their total mercy. Where we are currently positioned in the lock-in game, we could pull the ripcord and be sitting in a private datacenter within a week. All we need to do is move VMs, domain registrations and DNS nameservers if we want to kill the AWS bill.
I feel for those who are up to their eyeballs in cloud infrastructure. Perhaps you made your own bed, but you shouldn't have to suffer in it. These are very complex decisions. We didn't get it right at first either. Maybe consider pleading with your executive management for mercy now. Perhaps you get a shot at a complete redo before it all comes crashing down. We certainly did. It's amazing what can happen if you have the guts to own up to bad choices and start an honest conversation.
I would also be interested to hear the other side of the coin. Who out there is using 20+ AWS/Azure/GCP products to back a single business app and is having a fantastic time of it?