I run a small service, ifconfig.io, that is now getting 200 million hits a day from around the world.
The response from it is about as small as you could make it, however at that volume it is about 150gb a day.
If I hosted this on AWS, the bandwidth alone without any compute would cost $900 a month. Prohibitively expensive for a service I just made for fun.
The cost of just sending the HTTP response headers alone is the majority of that cost to. There is no way to shrink it.
It is currently hosted on a single $40 linode instance and can easily keep up with the ~2400 sustained QPS. I think it can get up to about 50% more traffic before I have to scale it. And linode includes enough bandwidth with that compute to support the service without extra costs.
I don't see how anyone pays the bandwidth ransom that GCP and AWS charge.
Well, to be fair, most of us who are paying the "bandwidth ransom" do have to scale, quite significantly I might add, and so the value is the platform as a whole.
Furthermore, if you are doing something for fun like you are, the bandwidth ransom definitely comes into play for elastic cloud environments, but anyone doing anything significant on AWS/GCP has definitely already negotiated down their bandwidth spend with their AWS/GCP account management team.
At large scales, decisions need to get made. AWS and GCP will not negotiate with you unless you're big enough to make any if that worth their time.
Netflix is a great example. They run most of their services on AWS. But they also run their own CDN with real hardware in data centers because serving it from Amazon would be a deal breaker.
There are reasons to use AWS and GCP. But when I start a project, I don't start there. It's too expensive one way or another, and the "free" tier gets blown out extremely quickly.
A smaller provider will provide what you need, normally be cheaper, and has no lock in. If you later decide that you really want autoscaling or managed databases then you can move easily. And if you do switch, you'll at least know what your product even wants to be, and it's projected growth.
For a lot of services, bandwidth is the smallest part of the hosting cost; often around 10%. It really depends on the kind of workloads and traffic you are getting. Of course, the low percentage is partially because their other services are also all very expensive relative to a VPS or dedicated host, but its not really a comparable service offering.
This is a good list of ways to reduce outgoing bandwidth costs, but as someone who has switched from backend developer to running a small business, I can't help but notice that they don't talk at all about whether any of their cost savings were meaningful to the business.
Sure, it looks like they saved about $2000/month, but consider that those savings probably won't even pay for more than a quarter of a one of their developers.
Even though their service is free (their parent company gets business value from the aggregate analytics they obtain through their service), it very possible that there's something they could have done to bring more value to their parent company than the money they saved here.
Maybe it's unreasonable to expect a company to talk about that in a blog post, but it left me wondering.
My read was that they actually saved over $8000 per month:
- They mention that the initial savings of $1500/mo from omitting unnecessary headers was 12% of their egress cost (so the total before this was $12500)
- Then they got an additional 8% of savings by increasing the ALB idle connection timeout to 10 minutes (down to $10120)
- Finally they said they saved $200 per day by switching to a lighter TLS certificate chain ($6000/mo, so down to $4120)
None of those steps seem to have required any meaningful amount of development work. Let's say this took a developer one week? The return on that effort would be $100k a year, or $2500/hour for the first year alone.
Considering they have enumerated this for others to pick up and execute quickly, they may have just saved the wider industry potentially 100s of thousands per month.
Give and take is an open source attitude. It doesn’t always have to be about source code, sometimes it can be about cost savings techniques such as this.
> consider that those savings probably won't even pay for more than a quarter of a one of their developers
Although I never run a business, I do believe this kind of optimization is quite meaningful even though they will never be the top priority of a business.
Those optimizations lower operational cost while being mostly maintainance free (except the one that switches off from AWS certificate manager, which may increase some effort when renewing), risk free (unlike refactoring a large legacy system) and requiring little engineering effort (Maybe 10 engineering days from investigation to writing the blog post?)
In addition this blog post itself brings intangible benefit on their branding, website ranking and hiring.
I think you're exactly right. It has become a HN trope that every cost optimization story gets a response like this: your infrastructure costs are trumped by the cost of your developers, so why spend the expensive resource (developers) on optimizing the comparatively cheap bit (infrastructure). I'm tired of the trope because it's such an oversimplification.
What matters is the return on investment, and as you state, one of the great things about cost optimization is that its returns come largely risk free. By my math the optimizations described here return $100k a year. On a risk-adjusted basis, what task could this developer have performed that would have returned more?
In this thread line regarding small businesses, another critical point is that the $24,000 (and certainly in the $100k premise) also might be part of the remainder compensation or profit calculation for the owner of the business. Sure it pales next to the cost of five engineers and yet it could easily be anywhere from 1/10 to 1/3 of the annual profit for a small business. If you're the owner, that's a big deal over time. You never know how tight a small business has to operate, however typically it's thinner than not.
Let's say it is $2k/mo. I also run a small business. And when things are growing it's easy to think that way. But in the long run every business that faces competition needs to focus on the bottom line. How many developer hours do you think it took to save that $24,000 per year? Not much. And that is just one example. A culture that ignores efficiency is doomed to failure.
> Let's say it is $2k/mo. I also run a small business.
Your server bill should be $100/mo - $200/mo max for a "small" business. I've ran a multi-tenant SaaS platform on a $200/mo DigitalOcean budget (server for Postgres, server for Redis, server for node.js apps) that brought in $30k/mo. If you're spending that much a month on cloud hosting, consider yourself got by the marketing of "serverless".
I think you missed GP's point. They aren't telling us how much their business makes, they are replying to the posts saying "that wasn't worth it" -> "maybe it was worth it because it was probably ~8k" with "even if it was 2k it was worth it".
I'm not sure this is really questioning anything more than "I wonder if there is something they could have done better in terms of business operations" to which I can't imagine the answer ever being anything other than "yes", especially in retrospect.
> those savings probably won't even pay for more than a quarter of a [developer]
So you're assuming that configuring nginx properly, once, takes 3 months, every year? If it takes the developer (or sysadmin) less long than that, you're already saving money.
If it saves $24,000 a year, and your developer cost is $100/hour, 240 hours or less spent a year on this effort is your breakeven. Pretty sure that's a win.
If you fit inside of the CloudFlare T&Cs, you can probably save a much larger amount terminating there and having them peer with you using the same TLS every time, or failing that, try someone like BunnyCDN.
I've found that while AWS CloudFront is easy to instrument, it's neither very performant (lots of cache misses even when well configured), or cost effective (very high per byte cost).
This. If your service is collecting aggregated analytics data from users, bytes that those users would never care to send in the first place, you can get vastly vastly better pricing on traffic by going with providers that don't care too much about high-quality peering.
This is basically saying, use a 3rd party CDN (e.g. Cloudflare) to handle and terminate client connections, letting the CDN pipeline the actual requests through a handful of persistent connections to your server.
We went through something similar a couple of years ago, when TLS wasn't as pervasive as it is today and at first focused mostly on minimising the response size – we were already using 204 No Content, but just like the OP we had headers we didn't need to send. In the end we deployed a custom compiled nginx that responded with "204 B" instead of "204 No Content" to shave off a few more bytes. It turned out none of the clients we tested with cared about the string part of the status, just that there was a string part.
When TLS started to become more common we realised the same thing as the OP, that the certificates we had were unnecessarily large and costed us a lot, so we switched to another vendor. When ACM came we were initially excited for the convenience it offered, but took a quick look but decided it would be too expensive to use for that part of our product.
I was honestly expecting some kind of meh article that said to reduce headers, enable compression and other basic stuff. I was pleasantly surprised that wasn’t the case... and absolutely astounded that the handshake provided that much of a difference, it was the last thing I would have thought of.
At such a high volume of requests it probably makes sense to consider going one abstraction level lower by replacing HTTPS with plain SSL sockets based communication for further cost reduction.
I think using HTTPS is fine. But there is probably some value in using GRPC+proto by default instead of REST+json. With client-side streaming, you set up and tear down the connection less frequently, and that means you negotiate TLS and send initial headers less frequently. And the messages themselves are smaller, especially for small messages.
GRPC streaming is almost as efficient as just using a raw TCP stream, but saves you having to write the protocol glue code. There are already clients and servers that work, and you can just write your protocol definition in the form of a protocol buffer. Worth a look for this use case.
(Also, the clients know how to do load balancing, so you don't have to pay Amazon to do it for you. Unlike browsers, most language's GRPC clients are happy to take a list of IP addresses from DNS and only send requests to the healthy endpoints. Browsers, if you're lucky, try opening a TCP connection but will happy keep the same IP address even if it 503s on every request. Chrome, Firefox, and Safari all do different things.)
that is of course true, but they won't be able to ommit not working/failed/overloaded nodes whereas a load balancer might be able to do so. On the other hand the client might be programmed to just use another IP from the list and resend request if one node fails to answer, but this would increase the total time required by the client to do a successful connection.
I also realise that non-responsing nodes might be rare enough for this to be a negligible problem - just playing devils advocate here.
No you can do all that stuff with grpc. You can use active health checks (grpc.health.v1) to add or remove nodes from the pool. (You can configure the algorithm that is used to select a healthy channel for the next request, too.) You can also talk to a central load balancer, which provides your client with a list of endpoints it's allowed to talk to.
When you control the client, you don't have to resort to L3 hacks to distribute load. You can just tell the client which replicas are healthy. (And both ends can report back to give the central load balancer information on whether or not the supposed healthy endpoints actually are.)
L3 load balancing actually works somewhat poorly for HTTP/2 and gRPC anyway. They only balance TCP connections, but you really want to balance requests. That is why people have proxies like Envoy in the middle; the client isn't smart enough to be able to do that, but it is. But if you control the client, you can skip all that and do the right thing with very little resources.
> Also, the certificate contains lengthy URLs for CRL download locations and OCSP responders, 164 bytes in total.
If you're going on that path It's probably best to avoid revocation altogether, since it doesn't really work, and go the let's encrypt way, certificates with lower lifespans.
On that scale a 15 days cert on rotation is probably fine.
> We’re currently using an RSA certificate with a 2048-bit public key. We could try switching to an ECC certificate with a 256-bit key instead
Having just ruled out RSA on an embedded project for exactly this reason, definitely the first thing that came to mind.
If they’re getting down to the byte differences, under their additional options, they really should have had binary serialized data instead of JSON. Something like CBOR “can” near immediate conversion to JSON but it would mean an update to all of their end points and they might not be feasible but could be worked in for new projects over time.
I'm sad about the state of support for ed25519/curve25519 crypto in TLS.
If you could reasonably deploy a website that doesn't offer anything else for https, you'd instantly fix many session establishment-based CPU DoS attacks. It's multiple times faster than what you usually allow your server to negotiate.
I doubt it. AWS's certs are just another three-quarters baked AWS feature. They did the best they could with the resources they had.
At my last job we had a fun and exciting outage when AWS simply didn't auto-renew our certificate. We were given no warning that anything was broken, and it apparently began the internal renewal process at the exact instant the cert expired (rather than 30 days in advance as is common with ACME-based renewal). Ultimately the root cause was that some DNS record in Route 53 went missing, and that silently prevents certificate renewal.
We switched TLS termination from the load balancer to Envoy + cert-manager and the results were much better. You also get HTTP/2 out of the deal. We also wrote a thing that fetches every https host and makes sure the certificate works, and fed the expiration times in prometheus to actually be alerted when rotation is broken. Both are features Amazon should support out of the box for the $20/month + $$/gigabyte you pay them for a TLS-terminating load balancer. Both are features Amazon says "you'll pay us anyway" to, and they're right.
> it apparently began the internal renewal process at the exact instant the cert expired (rather than 30 days in advance as is common with ACME-based renewal).
> Q: When does ACM renew certificates?
>
> ACM begins the renewal process up to 60 days prior to the certificate’s expiration date. The validity period for ACM certificates is currently 13 months. Refer to the ACM User Guide for more information about managed renewal.
> We switched TLS termination from the load balancer to Envoy + cert-manager and the results were much better. You also get HTTP/2 out of the deal. We also wrote a thing that fetches every https host and makes sure the certificate works, and fed the expiration times in prometheus to actually be alerted when rotation is broken. Both are features Amazon should support out of the box for the $20/month + $$/gigabyte you pay them for a TLS-terminating load balancer.
I don't see any current load-balancer priced at $20/month (ALB, NLB and Classic ELB are all ~ $8/month), so I can't guess which one you were using here ...
I have no memory of when this was but it was on the order of 9 months to a year ago.
"up to 60 days before" includes "five minutes after". What it excludes is the renewal starting 61 days before the cert expires, and, as documented, it sure didn't do that.
Stuff went wrong and we had no observability. That is the AWS way.
Not to be mean, but you definitely had observability into the expiration date of your certificate. You just weren't monitoring it yet. What you are doing now with Prometheus sounds good.
If you need to figure out for yourself what to monitor about the service, including things AWS says it handles, it brings into question the value of the service.
Interestingly while ALBs support serving HTTP2 client requests, they only proxy them to back ends using HTTP1.1. This breaks some use cases like gRPC unfortunately.
Funny enough, Amazon.com uses a Digicert certificate similar to the one mentioned on the article, they don't seem to use the ones they provide for free on AWS :slightly_smiling_face:
You have to terminate TLS at their load balancers though as they don't hand out any private keys of course. Still a great service.
Digicert is pretty expensive otherwise... always a shock when I look up prices... There is let's encrypt, but I never tested it with anything hosted on AWS.
Still, the article has great tips. And even if your app is some B2B service with <200 users, it still wouldn't hurt to implement the measures. Even if the product owner doesn't care if the solution costs 20$ or 200$ a month. Some of these tips are pretty low effort. Saves energy at least.
Big surprise. Contrary to the popular belief, AWS wasn't/isn't built to support Amazon.com. Some fundamental pieces are designed for Amazon.com scale, but most other services are not (ACM in this case)
Of course it's true that they don't use all AWS services, either because they don't need them or because they had something built in house earlier which works for them.
Didn't see it mentioned: SSL tickets. If you were running a NLB and nginx in a pool of instances, you can use an Openresty-based inplementation of SSL tickets to dramatically speed up negotiation of reconnecting clients. You will need a Redis server to store the rotating ticket keys but that's easy with AWS Elasticache. You will also need to generate the random keys every so often and store them in Redis, removing the oldest ones as you do. This is a task that I accomplished by writing a small Go service.
If you serve a latency-critical service, tickets are a must.
I guess this might be specially relevant for traffic patterns similar to the one described in the article, for other use cases most likely those optimisations will not translate into big savings
How about the obvious solution of not having ANY data transfer out?
Encrypt and sign the data via NaCL or similar, send via UDP duplicated 5-10 times, no response at all from the server (it's analytics, it doesn't matter if very few events are lost and you can even estimate the rate).
As for the REST API, deprecate it and if still needed place it on whatever 3 VPS services have the lowest costs, and use low TTL DNS round-robin with something removing from DNS hosts that are down.
Fascinating article. I love posts with this type of in-depth investigation into what everyone else would just pass over and not even think about.
It's not surprising that it's related to the gaming industry. Some of the best AWS re:Invent videos I've seen are in the GAM (gaming) track. Even though I've never worked in that field, the problems they get hit with and are solving often are very relevant to any high-traffic site. Because of the extreme volume and spikiness of gaming workloads, they tend to find a lot of edge cases, gotchas, and what I'll call anti-best practices (situations where the "best practice" turns out to be an anti-pattern for one reason or another, typically cost).
I wonder what the cost is compared to terminating SSL at Clodfront? For my web tier architectures, I use Cloudfront to reverse proxy both dynamic content (from the api) and static content (from s3). SSL is terminated only at CloudFront.
So for 10k HTTPS requests, the price is 0.01 $. If you serve 5 billion per day, that is 5000$ a day. With such high traffic I believe it is needed to handle it using performant webservers (Go, Erlang?) to keep costs reasonable, and probably terminating SSL at load balancer is the way to go
I am not sure that math is right. Using the aws cost calculator, its only about 1100/mo for 5B https requests. However, I think if you consider data transfer its still probably in the range of a several thousand a day. yikes.
Not sure what calculator you're using, but from the pricing page [1] it's pretty clear that 5B HTTPS requests cost at least (depending on the geographic origin) $5000. And that's per day and without data transfer.
This is an awesome article but if your egress costs are so high that you're deciding which HTTP headers to exclude, you should probably be moving to an unmetered bandwidth provider, or at least one that charges a reasonable amount for egress.
Is there any such thing? I don't know of any cloud service provider that offers unlimited bandwidth. There are very few providers who could handle five billion connections per day in the first place, regardless of bandwidth.
5B requests/day is ~60k/second, that's big but nothing insane. There are numerous frameworks/setups that can do far more than that on a single machine [1]
popular unmetered options: he.net, ovh, hetzner - You generally lose a lot of the "cloud" capability with these options however.
cloud options: digital ocean egress is $0.01/GB ($0.005/GB if you buy it via droplets), linode is $0.02/GB, vultr is $0.01/GB, etc.
I'm talking about actual unmetered where you pay for a dedicated amount of bandwidth, e.g. 1 Gbps / 10 Gbps / 20 Gbps. 10 Gbps usually goes for about $1k-$2k/mo in the US. This is how colo facilities have operated for decades.
10 Gbps fully saturated delivers about 3300TB for that $1-2k/mo, versus the $22k/mo you'd pay AWS for the same.
I'm absolutely not talking about the "unlimited bandwdith" bullshit that discount hosts offer.
If your project gets featured on CNN and your bandwidth goes up 20x can these colo arrangements automatically scale up your dedicated bandwidth? I ask because having an outage when you get your first big break can cost you WAY more than your bandwidth bill ever would...
DI.FM uses cloudflare + bandwidth alliance [1] for their streaming audio network, so I'd model after that. Cloudflare isn't exactly transparent about their egress pricing, but most discussion seems to indicate once you start hitting about 50TB/mo, they'll strongly encourage you to upgrade to their $200/mo plan. But you can likely push tens of terabytes per month on their free or $20/plan.
Maybe also consider caching API responses in a cheaper non-AWS CDN where possible. APIs like "zip code to list of cities" where the output is the same for all users, and doesn't change often.
It was added as part of a bug fix five years ago: the server was looking at the Content-Type request header instead of Content-Encoding to determine whether the incoming payload was compressed. Not sure why the Accept-Encoding response header was added as the same time, but it went undetected since it didn't cause any problems (apart from costing money).
protobuffs is an option that could work for SDKs, but the API is also a public documented REST one: https://gameanalytics.com/docs/item/rest-api-doc. Also, in the responses could be possible to just not include a body, and AWS does not charge for data transfer in, so the size of the request JSON is not relevant for the cost
If you are running 5 billion daily requests where your outgoing response size is significantly less than the aggregate of the size of the headers, then yes.
Also, the article clearly articulates that the answer is, yes.
> HTTPS is a complete waste of energy. Security should not be overarching, it should be precision.
A harmless meme in the US might get you executed in North Korea. Optimizing for energy usage (which is already pretty minor on modern hardware for HTTPS these days) over security is odd.
Electric use for the client on compressed vs not compressed isn't as clear cut as more/less cpu. You also need to consider the reduction in use of the network interface, since the data size will be smaller. Overall latency could improve as well if the compressed form is meaningfully smaller (depending on the tcp congestion window, just one packet smaller can mean a whole roundtrip time)
No, that's not how it works, you cannot upgrade the routers in real-time without complexity and additional cost. So the cost for transfer is fixed with more latency. But if you subtract bad protocol design and the latency added by the compression/decompression I'm pretty sure you end up with the same deal just more complexity that costs even if you don't see the costs.
Just like wind-power actually competes with nuclear because it take 30 days to wind down a nuclear power plant.
Also data can be compressed with more efficient hardware on the backbone without you having to deal with it.
The biggest cost of the internet is idle things and synchronized CPUs, async. never made it unfortunately.
Still there is going to be energy lost for very little in return. We need to go the other direction; less machines, less IP addresses, less energy, less complexity, less code, etc.
The only thing we need more of is cores and we can't have that because memory is too slow.
I think this post shows that indeed small things adds up. Energy might rise by using HTTP/2, but that's not the concerns of OP, they want to reduce their cost, not their energy footprint.
I think you are going to have a rough time if you separate energy and money like that. The only reason the dollar is worth anything is because of coal, oil, gas and nuclear.
What do you think happens to the value of the dollar when the physical supply of energy becomes unstable in the coming years?
Money is energy because debt needs energy to either have been spent in the past or energy being promised to be spent in the future.
The proportion between these is what makes debt money trustworthy or not. When the states around the world privatizes houses that is old energy (so far since the energy to heat the house is marginal compared to building it), but when the stock market goes up that is a promise for new energy.
Globally all money for old energy is saturated (negative interest rates) and now all liquidity is being injected into the stock market that promises that the future will be rich with energy.
The stock market (and all companies) is a promise to spend energy we don't have.
The only energy that is added to earth is sunlight, the only way to capture that energy are trees and plants.
All jobs are now meaningless because of the energy that we are wasting. And people now depend on wasting energy to have a job.
Do you see why energy is more important than money?
The response from it is about as small as you could make it, however at that volume it is about 150gb a day.
If I hosted this on AWS, the bandwidth alone without any compute would cost $900 a month. Prohibitively expensive for a service I just made for fun.
The cost of just sending the HTTP response headers alone is the majority of that cost to. There is no way to shrink it.
It is currently hosted on a single $40 linode instance and can easily keep up with the ~2400 sustained QPS. I think it can get up to about 50% more traffic before I have to scale it. And linode includes enough bandwidth with that compute to support the service without extra costs.
I don't see how anyone pays the bandwidth ransom that GCP and AWS charge.