I don't think that's the real point. The real point is that 'big 3' cloud provid...

sgarland · 2024-09-21T15:45:59 1726933559

> The other thing is that cloud hardware is generally very very slow and many engineers don't seem to appreciate how bad it is.

This. Mostly disk latency, for me. People who have only ever known DBaaS have no idea how absurdly fast they can be when you don’t have compute and disk split by network hops, and your disks are NVMe.

Of course, it doesn’t matter, because the 10x latency hit is overshadowed by the miasma of everything else in a modern stack. My favorite is introducing a caching layer because you can’t write performant SQL, and your DB would struggle to deliver it anyway.

chipdart · 2024-09-22T04:28:45 1726979325

> Of course, it doesn’t matter, because the 10x latency hit is overshadowed by the miasma of everything else in a modern stack.

This. Those complaining about performance seem to come from people who are not be aware of latency numbers.

Sure, the latency from reading data from a local drive can be lower than 1ms, whereas in block storage services like AWS EBS it can take more than 10ms. An order of magnitude slower. Gosh, that's a lot.

But whatever your disk access needs, your response will be sent over the wire to clients. That takes between 100-250ms.

Will your users even notice a difference if your response times are 110ms instead of 100ms? Come on.

sgarland · 2024-09-22T15:33:32 1727019212

While network latency may overshadow that of a single query, many apps have many such queries to accomplish one action, and it can start to add up.

I was referring more to how it's extremely rare to have a stack as simple as request --> LB --> app --> DB. Instead, the app almost always a micro service, even when it wasn't warranted, and each service is still making calls to DBs. Many of the services depend on other services, so there's no parallelization there. Then there's the caching layer stuck between service --> DB, because by and large RDBMS isn't understood or managed well, so the fix is to just throw Redis between them.

chipdart · 2024-09-22T17:28:56 1727026136

> While network latency may overshadow that of a single query, many apps have many such queries to accomplish one action, and it can start to add up.

I don't think this is a good argument. Even though disk latencies can add up, unless you're doing IO-heavy operations that should really be async calls, they are always a few orders of magnitude smaller than the whole response times.

The hypothetical gains you get from getting rid of 100% of your IO latencies tops off at a couple of dozen milliseconds. In platform-as-a-service offerings such as AWS' DynamoDB or Azure's CosmosDB, which involve a few network calls, an index query normally takes between 10 and 20ms. You barely get above single-digit performance gains if you lower risk latencies down to zero.

In relative terms, if you are operating an app where single-millisecond deltas in latencies are relevant, you get far greater decreases in response times by doing regional and edge deployments than switching to bare metal. Forget about doing regional deployments by running your hardware in-house.

There are many reason why talks about performance needs to start by getting performance numbers and figuring out bottlenecks.

sgarland · 2024-09-22T17:59:54 1727027994

Did you miss where I said “…each service is still making calls to DBs. Many of the services depend on other services…?”

I’ve seen API calls that result in hundreds of DB calls. While yes, of course refactoring should be done to drop that, the fact remains that if even a small number of those calls have to read from disk, the latency starts adding up.

It’s also not uncommon to have horrendously suboptimal schema, with UUIDv4 as PK, JSON blobs, etc. Querying those often results in lots of disk reads simply due to RDBMS design. The only way those result in anything resembling acceptable UX is with local NVMe drives for the DB, because EBS just isn’t going to cut it.

Nextgrid · 2024-09-22T13:31:01 1727011861

It's still a problem if you need to do multiple sequential IO requests that depend on each other (example: read index to find a record, then read the actual record) and thus can't be parallelized. These batches of IO sometimes must themselves be sequential and can't be parallelized either, and suddenly this is bottlenecking the total throughput of your system.

igortg · 2024-09-21T15:51:18 1726933878

I'm using a managed Postgres instance in a well known provider and holy shit, I couldn't believe how slow it is. For small datasets I couldn't notice, but when one of the tables reached 100K rows, queries started to take 5-10 seconds (the same query takes 0.5-0.6 in my standard i5 Dell laptop).

I wasn't expecting blasting speed on the lowest tear, but 10x slower is bonkers.

mrkurt · 2024-09-22T00:10:38 1726963838

Laptop SSDs are _shockingly_ fast, and getting equivalent speed from something in a datacenter (where you'll want at least two disks) is pretty expensive. It's so annoying.

mwcampbell · 2024-09-22T02:24:10 1726971850

To clarify, are you talking about when you buy your own servers, or when you rent from an IaaS provider?

jsheard · 2024-09-21T13:23:43 1726925023

It's sad that what should have been a huge efficiency win, amortizing hardware costs across many customers, ended up often being more expensive than just buying big servers and letting them idle most of the time. Not to say the efficiency isn't there, but the cloud providers are pocketing the savings.

toomuchtodo · 2024-09-21T13:30:01 1726925401

If you want a compute co-op, build a co-op (think VCs building their own GPU compute clusters for portfolio companies). Public cloud was always about using marketing and the illusion of need for dev velocity (which is real, hypergrowth startups and such, just not nearly as prevalent as the zeitgeist would have you believe) to justify the eye watering profit margin.

Most businesses have fairly predictable interactive workload patterns, and their batch jobs are not high priority and can be managed as such (with the usual scheduling and bin packing orchestration). Wikipedia is one of the top 10 visited sites on the internet, and they run in their own datacenter, for example. The FedNow instant payment system the Federal Reserve recently went live with still runs on a mainframe. Bank of America was saving $2B a year running their own internal cloud (although I have heard they are making an attempt to try to move to a public cloud).

My hot take is public cloud was an artifact of ZIRP and cheap money, where speed and scale were paramount, cost being an afterthought (Russ Hanneman pre-revenue bit here, "get big fast and sell"; great fit for cloud). With that macro over, and profitability over growth being the go forward MO, the equation might change. Too early to tell imho. Public cloud margins are compute customer opportunities.

miki123211 · 2024-09-21T14:13:16 1726927996

Wikipedia is often brought up in these discussions, but it's a really bad example.

To a vast majority of Wikipedia users who are not logged in, all it needs to do is show (potentially pre-rendered) article pages with no dynamic, per-user content. Those pages are easy to cache or even offload to a CDN. FOr all the users care, it could be a giant key-value store, mapping article slugs to HTML pages.

This simplicity allows them to keep costs down, and the low costs mean that they don't have to be a business and care about time-on-page, personalized article recommendations or advertising.

Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

toomuchtodo · 2024-09-21T14:24:57 1726928697

> Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

Reddit can’t turn a profit, Signal is in financial peril. Meta runs their own data centers. WhatsApp could handle ~3M open TCP connections per server, running the operation with under 300 servers [1] and serving ~200M users. StackOverflow was running their Q&A platform off of 9 on prem servers as of 2022 [2]. Can you make a profitable business out of the expensive complex machine? That is rare, based on the evidence. If you’re not a business, you’re better off on Hetzner (or some other dedicated server provider) boxes with backups. If you’re down you’re down, you’ll be back up shortly. Downtime is cheaper than five 9s or whatever.

I’m not saying “cloud bad,” I’m saying cloud where it makes sense. And those use cases are the exception, not the rule. If you're not scaling to an event where you can dump these cloud costs on someone else (acquisition event), or pay for them yourself (either donations, profitability, or wealthy benefactor), then it's pointless. It's techno performance art or fancy make work, depending on your perspective.

[1] https://news.ycombinator.com/item?id=33710911

[2] https://www.datacenterdynamics.com/en/news/stack-overflow-st...

miki123211 · 2024-09-21T14:00:18 1726927218

You can always buy some servers to handle your base load, and then get extra cloud instances when needed.

If you're running an ecommerce store for example, you could buy some extra capacity from AWS for Christmas and Black Friday, and rely on your own servers exclusively for the rest of the year.

martinald · 2024-09-22T10:28:40 1727000920

But the ridiculous egress costs of the big clouds really reduce the feasibility of this. If you have some 'bare metal' boxes in the same city as your cloud instances you are going to be absolutely clobbered with the cost of database traffic from your additional AWS/azure/whatever boxes.

Nextgrid · 2024-09-22T13:32:45 1727011965

Is database traffic really all that significant in this scenario? I'd expect the bulk of the cost to be the end-user traffic (serving web pages to clients) with database/other traffic to your existing infra a relatively minor line-item?

chipdart · 2024-09-22T04:12:02 1726978322

> I don't think that's the real point. The real point is that 'big 3' cloud providers are so overpriced that you could run hugely over provisioned infra 24/7 for your load (to cope with any spikes) and still save a fortune.

You don't need to roll out your own reverse proxy project to run services in-house.

Any container orchestration service was designed for that scenario. It's why they exist.

Under the hood, applications include a reverse proxy to handle deployment scenarios, like blue/green, onebox, canary, etc.

You definitely do not need to roll your own project to do that.