I'd originally posted this here: https://lobste.rs/s/surdxc/is_billion_dollar_wo...

I'd originally posted this here: https://lobste.rs/s/surdxc/is_billion_dollar_worth_server_ly...

But cross posting in case it's interesting to this audience.

Over the past few years of my career, I was responsible for over $20M/year in physical infra spend. Colocation, network backbone, etc. And then 2 companies that were 100% cloud with over $20M/year in spend.

When I was doing the physical infra, my team was managing roughly 75 racks of servers in 4 US datacenters, 2 on each cost, and an N+2 network backbone connecting them together. That roughly $20M/year counts both OpEx and CapEx, but not engineering costs. I haven’t done this in about 3 years, but for 6+ years in a row, I’d model out the physical infra costs vs AWS prices, at 3 year reserved pricing. Our infra always came out about 40% cheaper than buying from AWS for as apples to apples as I could get. Now I would model this with savings plan, and probably bake in some of what I know about the discounts you can get when you’re willing to sign a multi-year commit.

That said, cost is not the only factor. Now bear in mind, my perspective is not 1 server, or 1 instance. It’s single-digit thousands. But here are a few tradeoffs to consider:

Do you have the staff / skillset to manage physical datacenters and a network? In my experience you don’t need a huge team to be successful at this. I think I could do the above $20M/year, 75 rack scale, with 4-8 of the right people. Maybe even less. But you do have to be able to hire and retain those people. We also ended up having 1-2 people who did nothing but vendor management and logistics.

Is your workload predictable? This is a key consideration. If you have a steady or highly predictable workload, owning your own equipment is almost always more cost-effective, even when considering that 4-8 person team you need to operate it at the scale I’ve done it at. But if you need new servers in a hurry, well, you basically can’t get them. It takes 6-8 weeks to get a rack built and then you have to have it shipped, installed, bolted down etc. All this takes scheduling and logistics. So you have to do substantial planning. That said, these days I also regularly run into issues where the big 3 cloud providers don’t have the gear either, and we have to work directly with them for capacity planning. So this problem doesn’t go away completely, once your scale is substantial enough it gets worse again, even with Cloud.

If your workload is NOT predictable, or you have crazy fast growth. Deploying mostly or all cloud can make huge sense. Your tradeoff is you pay more, but you get a lot of agility for the privilege.

Network costs are absolutely egregious on the cloud. Especially AWS. I’m not talking about a 2x, or 10x, markup. By my last estimate, AWS marks up their egress costs by roughly 200-300x their costs! This is based on my estimates of what it would take to buy the network transit and routers/switches you’d need to egress a handful of Gbps. I’m sure this is an intentional lockin strategy on their part. That said, I have heard rumors of quite deep discounts on the network if you spend enough $$$. We’re talking 3 digits million multi-year commits to get the really good discounts.

My final point, and a major downside of cloud deployments, combined with a Service Ownership / DevOps model, is you can see your cloud costs grow to insane levels due to simple waste. Many engineering teams just don’t think about the costs. The Cloud makes lots of things seem “free” from a friction standpoint. So it’s very very easy to have a ton of resources running, racking up the bill. And then a lot of work to claw that back. You either need a set of gatekeepers, which I don’t love, because that ends up looking like an Ops team. Or you have to build a team to build cost visibility and attribution.

On the physical infra side, people are forced to plan, forced to come ask for servers. And when the next set of racks aren’t arriving for 6 weeks, they have to get creative and find ways to squeeze more performance out of their existing applications. This can lead to more efficient use of infra. In the cloud world, just turn up more instances, and move on. The bill doesn’t come until next month.

Lots of other thoughts in this area, but this got long already.

As an aside, for my personal projects, I mostly do OVH dedicated servers. Cheap and they work well. Though their management console leaves much to be desired.