I'm not in the web or cloud business, but I've filled a rack with my stuff before. My impression is that hardware has become a lot more capable even relative to its tasks. With high iops memory, many cores and obscene amounts of RAM, I would expect companies of a much larger scale (in $, FTEs, or most other metrics) can be served by one 4HE machine, or by one rack, or by one room. Thus I would expect the knowledge of how to handle 5000 hard drives to become more obscure, naturally, but the skill to run a decently sized web application to remain almost constant.
Does this math work out, or have the tasks become more demanding at the same speed that hardware has improved?
A bank running 50 different services, on different platforms, with serious audit requirements, physical and logical access control, strict change and configuration management, etc., has two orders of magnitude more complexity. And that shit is very expensive in manpower.
There are now businesses that explicitly depend on the elasticity of the cloud and can never really be moved on premise without massive up-front investment in hardware that may only be used a few times a year for their biggest customers. Trying to hybridize these workloads hasn't been very successful as of yet. It is possible that K8S could relive this problem but I haven't seen it in practice, at scale.
Instant Elasticity in Cloud is a myth. If you think you are going to get 1k hosts just like that from AWS you will have an unpleasant experience.
I work at one of the decent size tech company and we are split between cloud and on prem. From our experience you have to inform AWS/GCP in advance (sometime way early) if you are looking to meaningfully increase capacity in zone/region.
Sure, auto scaling few hundreds of hosts may be possible but people who run a service which needs few hundreds of hosts run it directly on AWS, they will run it some kind of scheduler+resource manager which will have some kind of operational buffer anyway (as in you would already have those hosts so cloud elasticity is not a factor here).
How early is "way early"? Because as long as it's shorter than the two-three weeks it'd take to order boxes, rack them, provision them (which would be automated but might still take a afternoon), deal with any QA hiccups... I'd much rather call my AWS rep and say "can we add 30% by Thursday" and have them figure it out (and at such a large scale you might be able to spread it out across a couple regions anyway unless you only serve a specific part of the world).
From what I have seen it is actually of the same order or sometimes more. In one of the region/zone we add few hundreds hosts every week but that is after telling them we plan to upscale this in this region upto some big X number.
This is the same with disaster recovery too. The idea that "oh, our main DC went down, we'll just spin it up in another region" is great until you realize that means you need reserved instances in another zone, that just like another physical DC, you won't be using.
Right now there is no specific distinction between what we want to run in Cloud vs On Prem. Important thing to note here is we use Cloud as an IaaS only. We have our own stack which sort of prepares the hosts before it is ingested into clusters as usable capacity.
We actually recommend not using custom cloud providers Databases or any other value added services.
Why not completely either way (on prem vs cloud) is something that happened way before I joined the group but I think the main reason is to have a tactical edge in the long run such that we avoid lock in. I guess in some ways it helps us negotiate pricing better.
Imagine moving a certain workload from GCP region to an AWS region as part of a failover drill.
I was talking about at scale, not a rack. If you can get by with a rack, you will pay more for the people to support it than the incremental cost of the cloud.
> If you can get by with a rack, you will pay more for the people to support it than the incremental cost of the cloud.
Probably a whole lot less.
At larger scale - I would guess it's the same thing. If an organization needs more than a rack during peak use, it can probably benefit from setting up its own infrastructure. Only in the uncommon case of short extreme peak use and almost no use most of the time does such elasticity make a could solution attractive. IMHO.
> My impression is that hardware has become a lot more capable even relative to its tasks.
Indeed. The margins are bonkers high. As an example, the amount of ram that you can stuff into a physical machine has at least doubled in the last five years, but the price of the average virtual machine has not.
You still want ha, failover, and disaster recovery. Then you need to set up stuff like bgp, dns, security rules, etc, etc etc. Complexity mounts pretty quickly.
Indeed. It seems that most of the people saying that cloud hosting is expensive have never run into the issues of making their own SAN, managing the provisioning of 20 different teams, etc.
The organizational complexity and specialist knowledge is mind-boggling and there is zero chance that your in-house knowledge is better than what Amazon can provide.
Installing rack servers and setting up services to run a site used to be a sort of rite-of-passage 15-20 years ago, but that time period of the web was different. Still, I would consider basic familiarity with the infrastructure necessary also today.
Increasing hardware performance relative to task load created the rational for virtualization. Virtualization also turned out to be rational with respect to consistency, convenience, maintenance, and so on. At that point, outsourcing to a cloud can be rational.
But fewer people get hands-on experience with the infrastructure, and it sounds like many consider it almost mythical. For example, realizing the amount of work that can be done in 4U today. What does amazon charge for 96 cores and 256GB?
Does this math work out, or have the tasks become more demanding at the same speed that hardware has improved?