It's interesting to see the heavily growing demand graph. Is that because people want to adopt it, or is it being mandated or encouraged as "best practice" etc? I'm not implying that true organic demand wouldn't exist because it definitely might, but I have seen in practice where leadership encourages or even mandates usage of FaaS, so the numbers go up even though on a neutral field people wouldn't necessarily choose it. There's also the "I'd like to try it" group who hasn't yet had experience with it so choose it as a target for learning/curiosity reasons.
I wonder because my own experience with FaaS has been mostly bad. There are some nice things for sure, and a handful of use cases where it's wonderfully superior, such as executing jobs where number of concurrent processes fluctuates wildly making rapid scalability highly desirable. The canonical use case being "make a thumbnail for this image."
For web servers though, I find the opacity and observability, and the difficulty running locally to be significant hindrances. There's also lock-in which I hate with a passion. At this point I'd rather manage static EC2 instances than Lambdas, for example. (To be clear I'm not advocating static EC2 instances. My preference is Kubernetes all the things. Not perfect of course, but K8s makes horizontal scalability very easy while improving on visibility, but that's a different conversation)
I think one of the core arguments for larger organisations is that incompetence can not ruin everything.
Firebase is a good case study for this: They heavily argue using Firestore and server less functions. If you succeed solving your problems using their offerings, then they will also guarantee that things run well and scale well.
Firestore, as when I used it last, did not support all the operations that can make traditional DBMS go in their knees. You have document level isolation, ie. no joins or aggregating functions (like count or sum across documents). These functions need to be implemented in another way using aggregators or indices.
So I agree that development is more fun when developing on proper runtimes using fully fledged databases. But when you manage several thousands of developers on all levels, then I think it makes sense to impose another architecture.
> Is that because people want to adopt it, or is it being mandated or encouraged as "best practice"
Its because its the simplest, fastest way to get compute for non realtime bits of code. Its much less hard to deploy stuff to, and its really simple to trigger it from other services.
A lot of things in FB are communicated by RPC, so its not really "web" fucntions that run on there, its more generic ETL type stuff. (as in system x has updated y, this triggers a function to update paths to use the latest version)
> Its because its the simplest, fastest way to get compute for non realtime bits of code. Its much less hard to deploy stuff to, and its really simple to trigger it from other services.
That's all true, but IMO the problem with functions is that they are initially so simple. But deploying code isn't actually that hard of a problem. The hard part is growing and maintaining the codebase over time.
I'm not saying there ISN'T a use case for them, but there should be a very good reason why you want to split them off of other services.
I worked at a place that went full lambda for a website (this was possibly 2016) They started out with huge velocity, things were much quicker to build and test.
Serverless (the framework) was a joy to deploy with, compared to what they were used to. They had complete control over their architecture for the first time. However they then slammed into fixing all the innovation tokens they deployed (New DB, New message routing, New hosting arch, New auth methods) and hit the productivity wall.
> as in system x has updated y, this triggers a function to update paths to use the latest version
This is the sweet spot. Eventing. Any kind of queue based workload, especially one with variable load is a potential candidate for FaaS kind of architecture. The alternative is a worker-pool specific the workload. FaaS just moves up the abstraction on how processes are managed to a global worker-pool.
Yeah it sucks to be a FaaS customer on a cloud run by someone else... you're just overpaying for easy instead of simple etc.
But if you're in Meta, and you're running on an abstraction built and maintained by Meta, then the it's running for cost instead of profit and the the incentives all align between user and infrastructure provider?
I imagine the development and deployment ease for a system at Meta for Meta could be dreamy. At least, it has the potential to be... :)
> It's interesting to see the heavily growing demand graph. Is that because people want to adopt it, or is it being mandated or encouraged as "best practice" etc?
FaaS is well justified from the point of view of an infrastructure provider. You get far better utilization from your hardware with a tradeoff of a convoluted software architecture and development model.
In theory you also get systems that are easier to manage as you don't have teams owning deployments from the OS and up, nor do they have to bother with managing their scaling needs.
It also makes sense in the technical side because when a team launches a service, 90% of the thing is just infrastructure code that needs to be in place to ultimately implement request handlers.
If that's all your team needs, why not get that redundancy out of the way?
Nevertheless we need to keep things in perspective, and avoid this FANG-focused cargo cult idiocy of mindlessly imitating any arbitrary decision regardless of making sense. FaaS makes sense if you are the infrastructure provider, and only if you have a pressing need to squeeze every single drop of utilization from your hardware. If your company does not fit this pattern, odds are you will be making a huge mistake by mimicking this decision.
>The rapid growth at the
end of 2022 is due to the launch of a new feature that allows
for the use of Kafka-like data streams [12] to trigger function calls
In regards to opacity and observability I don't see why it would be any worse to run PHP code on XFaaS than running the PHP code on another host.
Most teams at meta are free to choose which internal tools to use. We evaluated the maturity, performance, staffing levels, roadmap when choosing tools.
> There's also lock-in which I hate with a passion.
I wonder if this isn't clouding your judgement here. FaaS is subject to lock-in, absolutely, but for teams that need a bit of code run and need to not manage an instance, functions are the way to go. Need to be at an organization that's able to support that properly, but that's table stakes at this point.
As a developer who worked at small to medium sized companies, mainly developing microservices running on Kubernetes, I don't see a big advantage using FaaS as a customer. Maybe I am missing something? I am sure the provider gets more utilization for the hardware, but for the customer what's the point?
Most of the companies I've worked for or consulted with had some sort of background worker system for , usually with a queue in postgres/redis/rabbitMQ/kafka/whatever in between. It usually gets used for non-latency sensitive work like sending emails instead of doing that from the main request handler, so that the user gets their response without having to wait for the other stuff to be completed. Bigger companies sometimes have multiple groups of background workers, one for each service and each with their own queues, servers and autoscaling policies.
As I understand the XFaaS system from the article it functions a bit like a consolidated background worker system that any service can submit their work to and it will (try to) make sure it gets done in whatever SLO is specified. This is usually cheaper because now you can do low-urgency work from service B when there are urgent jobs from service A and vice versa, leading to higher average server usage. For small/medium businesses this is probably not something you'd want, since developing and operating such a system costs more than you save. But if you have a multi-million dollar bill for your background workers alone then it might be worth it.
Right, you just need a collection of XFaaS services, which need to be hosted somewhere... So you'll want some container orchestration tool, might as well use k8s.
If setting up a Kubernetes cluster isn't a terrible overhead for us, switching to FaaS would mean we would have more control or less control? I am thinking mainly of migration to another provider.
Maybe we would benefit more if we would have large spikes in resource usage, but we don't.
> Furthermore, XFaaS explicitly does not handle functions and the path of a user-interaction
The wording of this is a little strange, but does this really mean that XFaaS is not used to serve end-user traffic at all?
The general approach here is interesting. I've always thought that there are two potential benefits for Function-based hosting ("Serverless") – low cost of components via scale-to-zero infrastructure (good for rare events, or highly variable traffic), and developer experience (coding against an infra framework rather than build-your-own infra). I'd have expected the former to be much less of an issue at scale, to the point where it's not the leading benefit, so I expected to hear more about the dev productivity side, but instead this article focuses significantly on the performance and cost side of things.
> but does this really mean that XFaaS is not used to serve end-user traffic at all?
When I worked at a company with a similar-ish setup, we were told that if you could do it in less than a minute and the user has to wait on it, do it in the request, don't send it to a job. This was because it was "cheaper" to keep a thread running vs. spinning up the resources to do a job. When I say cheaper, I mean in dev-time, user experience, and actual resources.
1) on the front-end, the dev doesn't have to "refresh" or "poll" an endpoint to get the status.
2) user experience is better. many people naively check "every 30s" or something ... well, what happens if services are degraded and it takes more than 30s to respond to your status check? Now, after a few minutes, you have dozens of pending requests for the exact same resource.
3) we didn't have a fancy SLO priority queue, just a regular one, so sometimes, you could have your job stuck behind a ton of jobs that are a higher priority than you, causing what was only 30s in your tests to now take 2 hours with the user polling your status endpoint and randomly refreshing, then eventually contacting support asking why it isn't working.
I think they are trying to hammer home that its not a realtime system. Anything you pass into will be executed at some point in the future. so if you want something synchronous, its not the tool to use.
Author of the paper summary here - my understanding is that XFaaS doesn't run functions that are run in response to user input (e.g. XFaaS does not execute code that fetches and returns data because a user clicked on a button).
Intuitively, the popularisation of FaaS feels inevitable as lower level technical challenges get solved and give way to abstractions like this. Who knows though what our stack will look like in 20 years.
Looking behind what I have seen since I started paying attention in the 1980's, as it looks today, resold under new marketing terms, resold by newly founded startups that are disrupting the ecosystem.
I agree, FaaS is treading deep "CGI" territory, except there are a lot more resources available for nice to have operational overheads so people think it's largely different.
As Gibson said, "The future is already here, it's just not very evenly distributed". And in IT, sometimes the future takes a long nap. I wouldn't be too surprised if e.g. descendants of Linda spaces get re-discovered because external circumstances favor that model.
Although with FaaS, I don't see anything particularly new from a development perspective. It's, well, functions. Most of the time they aren't communicating in any kind of novel way, and often they're doing the decade-old stuff of reading files and slurping databases. The abstraction being more on the operational side of things this time.
Not that I'm complaining or doing the "it's just CGI" dance. Compared to other "cloudy" tech, there's actually potential for simplifying things and not just simulated VAX computers with more effort…
> To reduce latency, a common approach is to keep a VM idle for 10 minutes or longer after a function invocation to allow for potential reuse [45]. In contrast, if a FaaS platform is optimized for hardware utilization and throughput, this waiting time should be reduced by a factor of 10 or more, because starting a VM consumes significantly fewer resources than having a VM idle for 10 minutes
This seems somewhat surprising in such controlled environment. Why does idle function consume so many resources? If I think a normal Linux system, an idle process doesn't really consume much resources at all
A server idling on Function A can't be called for Function B. 144 servers idling for 10 minutes each before expiring makes up a day of compute-time for 1 server. For every 144 expected idle servers, Meta needs to purchase and maintain an additional physical machine to keep up with healthy throughput on xfaas.
Every 15 minutes, the 100k server network has 1,500,000 compute-minutes available.
For an extreme example, every 15 minutes, 50k DB Cleanup processes run for 5 minutes, and then sit idle for 10 minutes, totalling 15 mins each. In this scenario, to satisfy 250,000 demanded compute-minutes (16.6%), 750,000 compute-minutes were supplied (50%). All else equal, to keep up healthy throughput on the rest of the network, Meta needs to purchase and maintain an excess 33,400 (50%-16.6%) physical servers to satisfy this DB Cleanup process.
Reducing the idle-time to 1 minute. The 50k process runs for 5 minutes, and then sits idle for 1 minute, totalling 6 min each. In this scenario, to satisfy 250,000 demanded compute-minutes (16.6%), 300,000 compute-minutes were supplied (20%). All else equal, to keep up healthy throughput on the rest of the network, Meta only needs to purchase and maintain an excess 4,000 (20%-16%) physical servers.
That's an extreme example, but hopefully it demonstrates why at Meta's scale, the energy requirements of the servers can be secondary when optimising for hardware. Since optomising for hardware can make a difference large enough to mothball a moderately sized datacenter.
The server in this context is a "warmed" Function VM. The first time a VM instantiates, it goes through all of the setup required to run a function. That includes generic reusable stuff like Operating System setup, but also Function-specific things like JIT compilation of the Function software, library loading, language loading, all of which are specific to the Function being executed.
The Function VM is hyper-optomised to that one Function's task. There's compute-time costs in changing its role from Function A to Function B. In the public cloud, the only cost you save on is starting up the OS, if both Functions happen to use the same OS. It's faster in most public FaaS infrastructure settings to just destroy a Function VM than it is to transfer a VM from Function A's configuration into Function B's configuration. Similarly, it's faster to just leave the VM running for a little while, just in case the Function is invoked again.
When the paper discusses VMs idling, that's the scenario they describe, where servers (or more accurately software VMs inside of physical servers), are locked into a specific function for some number of minutes. These minutes are dead computing time.
Meta, as a private cloud provider to themselves, can do some interesting things to reduce the cold-start of their Functions, meaning they can run more efficient clouds if they choose, by massively reducing the idle-time that's an industry standard in the public cloud.
> There's compute-time costs in changing its role from Function A to Function B.
Why do you need to change existing VM in any way? Having an idle VM for function A should not prevent the host from instantiating new VMs for function B. I still do not see where is the cost of an idle VM.
> servers (or more accurately software VMs inside of physical servers), are locked into a specific function
I think this needs to be more specific; what resources are exactly locked to a VM? Are they pinning VMs to specific CPUs or something that prevents the host from scheduling other tasks there? If so, why?
Also, you sound very confident in your answers, do you have some additional sources you could point me to, or are you also basing all this on this one paper?
> Why do you need to change existing VM in any way? Having an idle VM for function A should not prevent the host from instantiating new VMs for function B. I still do not see where is the cost of an idle VM.
Because the host can only run so many VMs. What's being described is host resource exhaustion from running idle VMs.
But that doesn't still explain the original statement
> waiting time should be reduced by a factor of 10 or more, because starting a VM consumes significantly fewer resources than having a VM idle for 10 minutes
To me its not obvious why you would want to have time-based eviction of VMs ever. Surely it would be more efficient to evict VMs only as response to some resource pressure? Basically I'm imagining keeping VMs in some LRU style structure where they could get kicked out when needed instead of just having 1min timer.
But that is on abstract level, on concrete level its also not obvious at all what are the constraining resources that VMs consume? Especially in a sense that would be comparable to the resource consumption of starting a new VM.
I realize that they are probably implying that they have some sort of static resource allocation for VMs meaning that idle VM consumes as much "resources" as active VM. But its not stated anywhere in the paper and it is weird to make claims(/recommendations) that hinge on such silent assumptions, especially if they are in no way universal. I feel this thing of improving utilization be reducing idleness is somewhat central part of the paper, so that's why I latched on this thing.
I also realize that managing resources efficiently with VMs is bit more involved than traditional processes (=containers), but at the same time this is where Meta, running private cloud and practically owning the whole stack, could really do much more than public cloud ever could. They could rely on more VMs and host cooperating on resource allocation, and even have the possible language runtime (HHVM etc) cooperate here somehow.
Indeed now that I think about it, sounds like interesting question how would you design FaaS-optimized combined VMM, VM, and language runtime stack. They touch on that with their JIT caches, but that seems one fairly narrow optimization, I'd imagine there is lot you could do here. The end-game might look something like unikernels, or maybe something completely different.
(and I'm sorry if my comments have come off as combative, but this did stick out to me and I am just curious)
There's a table on the 2nd page of the paper that gives a summary of what goes on when the VM's container is initialised the first time:
INITIALIZATION PHASE:
(1) Start the VM.
(2) Fetch the container image and the function's code.
(3) Initialize the container.
(4) Start the language runtime such as Python or PHP.
(5) Load common libraries into memory.
(6) Load the function code into memory.
(7) Optionally, do JIT compilation.
INVOCATION PHASE:
(8) Invoke the function multiple times as needed.
SHUTDOWN PHASE:
(9) Stop the container if it receives no requests for X minutes (X=10/20/10 minutes for AWS/Azure/OpenWhisk respectively).
(10) Optionally, stop the VM
>I think this needs to be more specific; what resources are exactly locked to a VM?
Steps 1-7 are time-consuming processes, which load a container into memory so that step 8 can run rapidly when invoked later. Step 8 can be invoked an unlimited number of times on a VM once steps 1-7 have completed. Only the first Function execution on a VM pays the startup penalty.
>Are they pinning VMs to specific CPUs or something that prevents the host from scheduling other tasks there? If so, why?
They're pinning memory to VMs in their FaaS infrastructure, which in turn is memory consumed on their physical servers. A VM uses some number of MBs of memory to host a Function's unique container, its runtime, and its executables. Meta's FaaS solution completes several trillion operations per day, which is in the tens of millions per second avg. This relatively small amount of memory per VM becomes enormous at tens-of-millions-per-second scale.
>If so, why?
The resources are pinned to the memory by the first function initialisation, so that if there's a second request for the same Function, it can be executed on a pre-warmed VM, without going through the time-consuming steps 1-7. A trade-off between memory-consumption and time, both cost money.
There is some optimal amount of time between the end of the final Function invocation of step 8, and the start of step 9. If the amount of time is 0-seconds, then every Function request goes through steps 1-7. While if the amount of time is 1-hour, at midnight when Meta's large one-a-day batch jobs run, most of the servers fill with idle VMs until 1am, waiting to satisfy a batch-job that never instantiates.
Given that infrastructure cannot be aware of which Function execution is "the final Function invocation", it's challenging to pin this amount of time down. In a public cloud, that optimal amount of time is set in the 10s of minutes. In Meta's case, as a private cloud provider to themselves, they have a unique awareness to the usage pattern of their Functions, and therefore can predict more acurately that they don't need 10-minute-idle-Functions. They've instead been able to reduce the idle time to 1 minute. This releases Containers & VMs, and returns memory back into the server pool.
>do you have some additional sources you could point me to, or are you also basing all this on this one paper?
Everything I've said about Meta is based on this paper, because it comprehensively describes Meta's unique circumstances, and how they approached the problem. The paper itself has dozens of additional sources.
> They're pinning memory to VMs in their FaaS infrastructure, which in turn is memory consumed on their physical servers
The paper doesn't say this anywhere. There is no reason why idle VM could not release significant amount of its memory, or it to be paged on disk, or some hybrid in-between. Especially in Linux you have this whole virtual memory, disk cache, and memory mappings tightly coupled together so there is all sorts of things that could happen. It would be naive to just assume that a VM uses some static amount of memory, especially when the paper doesn't actually indicate anything like that.
> This relatively small amount of memory per VM becomes enormous at tens-of-millions-per-second scale.
Not really. a single firecracker vm has memory overhead of <5MB. even thousand VMs on a host is still just few gigabytes of overhead, not exactly enormous. Its very unlikely that they have anywhere near thousand VMs on single host.
> The resources are pinned to the memory by the first function initialisation, so that if there's a second request for the same Function, it can be executed on a pre-warmed VM, without going through the time-consuming steps 1-7. A trade-off between memory-consumption and time, both cost money.
That just explains why you want to have the VM running, not why you'd have some fixed resources pinned on them. They are two very different things. And it especially doesn't explain why you'd stop the VM on time basis instead of e.g. based on memory pressure.
Honestly, this feels like talking to AI. Not sure why you fill your comments with elementary basics of FaaS and making a whole lot of unfounded assumptions, while not really giving much of any real substance.
I think it means reserving resources. When resources are reserved, they cannot be used for anything else (even if you're not actually using that reservation) so one can also say that those resources are "consumed".
RAM, disk. Functions are often written in high level JITd languages, they may need to load large reference datasets from disk, they may require a lot of code to be transferred over the network before they can begin running, and because nobody trusts the security of the Linux kernel you also have to pay the VM startup time.
An idle process on Linux isn't much different: it consumes RAM, disk and a kernel. If you trust the software you're running enough you can amortize the kernel cost but the others remain.
What they mean is that if you trusted the kernel you’d run functions in containers instead of micro-VMs, since containers need fewer resources and better start/stop latency.
Yes. Old-style pre AWS clouds (ISPs with an FTP+LAMP stack) worked by using UNIX user sharing features. It went out of fashion because Linux wasn't good enough at work isolation and because it was insecure - too many local root escalation vulns. So now clouds all run on custom hypervisors. The VM/hypervisor interface is smaller and easier to secure than the userspace/kernelspace interface. Strangers don't share kernels, they share hypervisors.
Unfortunately FaaS platforms actually suffer twice, because Linux userland is too chaotic to use directly. So you have to boot a clean VM with Docker in it, then install a container into that VM so the user can send you software in the now-standard format, then start up the function.
Booting a Docker container from a standard running Linux system already takes ~100msec if I recall correctly. But 100msec of added latency can actually reduce usage and hurt revenue on big sites, so it's not acceptable. And that's just the inner container. Then you have the cost of booting Linux, cost of downloading the container (docker format is highly unoptimized) etc.
So all these costs add up and then the only way to solve them is to amortize them.
Oracle Cloud is working on a thing called GraalOS which is intended to help address this. It works by letting apps share a Linux kernel and using a userspace "supervisor" that relies on CFI, MPKs and NaCL style binary analysis to prevent code from connecting to the kernel directly. Containers and VMs are no longer necessary in that model and loading an app is just a case of copying it down to a node and mmapping it, but it does mean you have to be able to compile your app for this alternative pseudo-operating system. If you work with Java then you use native-image that makes Java apps compile to native code and start up super fast, so it works.
WASM also tries to solve this, essentially borrowing the isolation features from browsers for cheap startup latency. Cloudflare workers is the classic example.
Any insight to what kinds of work is being run in the functions? I run a lot of jobs in background queues but I'm wondering if I'm missing out on some new paradigm.
It's fundamentally the same as a job queue, but the difference is that the people writing the job are not creating a running OS process. You literally just write a function, and it gets compiled into a process owned and executed by the job system.
Why would you want that? Well, who really wants to think about the OS, or how to get their data into main()? You just want to write business logic, and FaaS lets your developers focus on that. It's a small development process optimization, but a significant one at scale if you have enough developers / unique jobs being created. And it lets platform engineers focus on the best way to shovel data into main() in your particular environment.
It's a "job queue" in old-people-words, just managed by someone else. Sometimes they even offer integrations like putting a job on the queue on database triggers, or file changes.
Nothing really new, except that you don't have to build it yourself ... but if you do and you sell it to your coworkers/clients/customers, you just call it FaaS.
No it's basically just a big background job cluster, but for all the microservices at the same time so they can get a higher average utilization out of their servers by averaging out the peaks.
Inovation regarding FaaS comes from BigTechs. I wish we had more OpenSource FaaS solutions, albeit OpenFaas amazing achievements still lacks more robust ecosystem.
It is literally "worth it" for companies much smaller than Meta due to the reduced infra management costs plus increased hardware utilization. A few hundred unique VMs is a boatload of work (even if you distribute it down to the dev teams), and most of those VMs will be sitting idle most of the time, just burning electricity to keep the RAM on.
But you would still need a boatload of a hundred unique VMS to run your serverless functions, right?
If you have a big enough SRE team to manage your serverless stack, economy of scale means that development teams utilising these can reduce their infra management costs. It is about the ratio of engineers utilising the stack compared to those that need to maintain it.
> But you would still need a boatload of a hundred unique VMS
No, you would have a boatload of generic worker VMs that could all be spun up as needed (autoscaled) from a common base image and deleted without any need to preserve state. Effectively, you're managing one VM image, which can be rolled out across your entire fleet very quickly and with zero downtime / disruption. This is even less disruption than with k8s because FAAS design is fundamentally short-lived processes, resulting in more automation (less work) for your SRE / VM team.
As far as I am aware there is no easy to set up open source FAAS framework. At least with Kubernetes there are a million and one tutorials, and there is a large support community.
The pendulum has swung (is swinging?) for many companies back to on-prem for cost savings. Self-hosted FAAS allows your developers to retain the abstraction over the compute platform (a step beyond what containers provide), and grants those running the physical infra significant flexibility in managing it. It's also arguably less complex than k8s for basically everyone, if your use case supports short-lived functions.
> The pendulum has swung (is swinging?) for many companies back to on-prem for cost savings.
This is only true of very large companies, so it doesn't change anything about this thread, which was saying that FaaS only makes sense if you're a public cloud or a large company with many on-prem servers.
FaaS could be sharing your hardware at a function level inside the JVMs or whatever you are running, rather than at the VM or hypervisor level. You can (potentially) get much better elasticity and utilisation?
Software already shares hardware at the function level. You have far more visibility and control over this within your program than you do when you add an unnecessary layer of abstraction like FaaS.
Think more about the name: Function as a Service.
Why do you need to serve yourself your own functions? It makes sense only in the context of a cloud provider that's trying to solve the problem of underutilized resources used across many customers.
Yes, but I'm just confused because saying "Why isn't there an open source FaaS for on prem that shares hardware" makes no sense. At that point, you're just talking about regular coding with functions. Unless there's something I'm missing.
I'm curious whether their "Locality Groups" concept can be implemented using stock AWS. Imagine you have 1000 type of tasks that can be called millions of times by users; each node can do any task but resource usage would be better if same task was performed on a node that already has done it. Sticky sessions is not it, as the same task can be initiated by different users.
I wonder because my own experience with FaaS has been mostly bad. There are some nice things for sure, and a handful of use cases where it's wonderfully superior, such as executing jobs where number of concurrent processes fluctuates wildly making rapid scalability highly desirable. The canonical use case being "make a thumbnail for this image."
For web servers though, I find the opacity and observability, and the difficulty running locally to be significant hindrances. There's also lock-in which I hate with a passion. At this point I'd rather manage static EC2 instances than Lambdas, for example. (To be clear I'm not advocating static EC2 instances. My preference is Kubernetes all the things. Not perfect of course, but K8s makes horizontal scalability very easy while improving on visibility, but that's a different conversation)