Hacker News new | past | comments | ask | show | jobs | submit login
Queues don't fix overload (2014) (ferd.ca)
193 points by akbarnama on Jan 18, 2024 | hide | past | favorite | 153 comments



What queues do is smooth out the mismatch between supply and demand. If the mismatch lasts long enough, the queue will overflow, and then you need to load shed (and you need to plan for what the least bad way of load shedding is). But queues do increase overall throughput, up to a point. If the demand was bursty on short timescales and you only allow a small queue to build before load-shedding, you may be wasting capacity. Increasing the allowed queue size lets you smooth out the demand, and so reduce the amount you need to load shed. But the crucial thing here is you're trading off latency against capacity. If that latency increase itself has no significant downside, then you should probably provision the queues so that they can grow until the latency does start to have a downside, because below that point the increase in capacity is worth it. Above that point, the latency itself is a problem, and you should be load-shedding rather than queuing. The trick with any queuing system is to understand when this transition should take place.

Edit: I should add that standing queues serve no purpose. If the demand is such that you're persistently queuing for a long time without the queue ever going idle, your transition between queuing and load-shedding is in the wrong place, and you should increase load-shedding, because you're adding latency for no reason.


See https://en.wikipedia.org/wiki/Bufferbloat

I lived in Germany in 1999 and then the internet connection from Germany to the US would get overloaded during the day. At maybe 9am the latency would be < 1 s but it would start to grow to a length of 90 seconds, then it would start dropping packets.

I don't know if it was the intention but it was about as good as a ban on VoIP at preventing people from making international VoIP calls.

Today there is more consciousness about the problem but people frequently create an overly long or nominally unbounded queue: in that case you start with one problem, insufficient throughput, can add the problem of unacceptable latency by adding a small queue. Limit the length of the queue and you put a cap on latency.

Backpressure is hard to implement for social and political reasons as much as technical, people don't really like the idea that the system can refuse service and of course it often has to be pushed further back than people would expect.

I'd point to the example of a medical specialist office where once you are being seen as a patient you might have a series of appointments. The initial appointment controls admission to the system as a whole so you might wait a few months to get the first appointment, enough that you might "balk" and look for another doctor who can see you sooner. Once you are being treated you often have no problem rescheduling an appointment or scheduling follow-up appointments in a timely manner.


See RFC 970, "On Packet Switches With Infinite Storage" by John Nagle: https://datatracker.ietf.org/doc/html/rfc970

Back then (I was studying networking as an undergrad at the time, and interned with the Arpanet team) people really did think of network congestion as a buffer allocation problem, so the obvious solution was more buffering - i.e. adding queues. Nagel was one of the first people to point out the problem with this.


> (...) people really did think of network congestion as a buffer allocation problem, so the obvious solution was more buffering - i.e. adding queues.

I don't think people believed network congestion was a buffer allocation problem. I think people believed buffering was a way to mitigate congestion caused by a spike in traffic before allowing connection problems to surface (dropped packets, dropped connections, etc).

Buffers are bounded which, by definition, means they can be filled up. Once they are filled up then the same problems that were caused by a congested network without buffering would also be experienced in connections with buffering. It's hard to believe that back then no one noticed that buffers would only take so much data.

> Nagel was one of the first people to point out the problem with this.

Correct me if I'm wrong, but the problem that was pointed out was instead modes of failure that take place when networks are congested but buffers are still able to take more data. We're talking about failure modes that happen when the network is congested and buffers are not completely filled but they neither can flush their data nor drop packets/connections, thus it's a steady state characterized by a network that in theory is working, but induces high latency.

Also, limiting buffers is also not a fix for network congestion problems. It's a way to allow a failure mode (dropped packets) to happen much earlier than another failure mode (long queues with high latency).


> I don't think people believed network congestion was a buffer allocation problem.

Actually, they did, because everything in the early days had very small memory sizes. This was the 16-bit era.


> Actually, they did, because everything in the early days had very small memory sizes. This was the 16-bit era.

I don't think that's a possibility. If buffers are smaller then buffer capacity was not infinite, and the same failure modes applied. Triggering dropped packets/connections was just a matter of sending enough network traffic down a pipe, which given buffers were small then it wasn't a technical feat.


Sometimes it is - TCP incast [1] in a fast LAN can be mostly alleviated by using switches with large buffers. Generally the higher throughput the bigger buffers you need unless having low latency is more important than low packet loss. The queue size is a tradeoff (as almost everything).

[1] https://www.usenix.org/system/files/login/articles/chen12-06...


One of our customers bought a special switch with huge buffers, on the order of hundreds of megabytes per port.

It was a specialised SKU used only for dedicated backup networks. As in, networks that process only the traffic for disaster recovery backups. Latency on that type of network traffic is totally irrelevant, and the only thing that matters is achieving the maximum possible throughput — whatever the wire protocol allows.

That’s the one time I’ve seen exactly 10 Gbps sustained.


Ugh. WiFi signals aren't the best in my A/V cabinet, so I ran G.hn powerline to it. This works great 98% of the time, but occasionally there will be something that blocks traffic for a short period of time, and those things must have huge buffers.

If I'm e.g. watching Netflix when there is a hiccup, I see ping times of over a minute! I wrote a program that monitors ping times and reboots the G.hn adapters when they get too high and the problem mostly went away. I tried a couple different brands of adapters, but they are all obviously the same firmware.


Do you live near an airport (<20/50 miles/km) or near weather stations: https://en.wikipedia.org/wiki/Dynamic_frequency_selection?

You are probably using channels that interfere with radar and when a router detects the radar, it should shutdown for a minute or so.


Does that apply to G.hn? I thought it was only WiFi on the 5Ghz band?


Ah, I read it as though you switched to G.hn because of wifi issues and thought you were describing that issue.


That's because it is obvious and intuitive even though it's wrong. In the first year of my undergrad in an assignment I added unlimited buffering to every single place that needed buffering. I remember a sense of superiority from all the extra code I wrote to do it. Of course buffering felt correct! That was long before I heard of the word backpressure.


That was a long time ago. Yet we still have bufferbloat problems. Sigh.


Another approach than capping the queue explicitly is to record time and cap latency by treating the queue as full if the front of the queue is "too old". If the time to process the queue is easily predictable this is roughly the same, but when it's not it'll make the queue length adapt dynamically depending on maximum acceptable latency.

To your latter example, there are load balancers with support for this pattern: serving up a "please wait" page to new users to prevent unacceptable latency for users already using the site. Frankly more sites ought to do that.


> Backpressure is hard to implement for social and political reasons as much as technical

Cal Newport has a video about how people will average around 20% above their capacity because at that point they have mental cover (permission given to oneself) to start rejecting additional requests. https://www.youtube.com/watch?v=TH_xAR7pljU


It's important to note that for interactive applications, queues can have surprisingly little capacity and still provide all the demand levelling necessary. Technically this depends on the variance of the arrivals, but in my practical experience, even quite bursty arrivals are often well-behaved enough that a second or so of queue capacity is sufficient.

Any time I've had queues need more than a second of capacity has been when the system is broken and user experience has already degraded to the point where it might be better for the median user to shed a small amount of load to get back to a sub-second queue.

(In batch processing where you aim for maximum throughput, obviously you need a long queue to make sure you never run out of job.)


You don't need a long queue even there if the producer can keep a short queue filled. E.g. a producer that receives an error when the queue is full and backs off but never backs off enough for the queue to fully empty can very well do just as well as maximizing throughout.


Re: your edit: I think in latency insensitive applications, having standing queues that never empty can be totally fine. We custom manufacture ordered items and have priority queues that process input orders into machine-ready output files. We don't need those queues to ever empty as the fastest delivery times are 3 days and the slowest typical is 14 days. Some of the file-prep software is licensed (so we have economic reasons to keep that part of the farm size constant and no larger than needed); in other cases, we use cheaper compute when available.


Agreed, if there's no latency cost and no cost to maintaining the queue. But if your queue is never emptying, you've hit an equilibrium where you're persistently overloaded and constantly load shedding. In that case you can reduce the queue size so that it is only just not emptying - you'll still be running at 100% capacity and shedding the same amount of demand. And if there was a latency cost or a cost per item in the queue, you reduce those.

Edit: if you have an external control loop that is matching supply and demand, you could be persistently queuing without load shedding, in which case you might want to maintain the standing queue. In the absense of an external control loop, supply and demand will pretty much never be in precise equilibrium, in which case a persistent queue will pretty much always also mean you're load-shedding.


> But if your queue is never emptying, you've hit an equilibrium where you're persistently overloaded and constantly load shedding

I believe this is an incorrect assertion. You have hit equilibrium, yes. But you are not load shedding unless you are preventing new queue entrants. And if you are allowing new entrants and the latency is acceptable to customers, you are not overloaded. Like you said, you are at an equilibrium.


As long as we are consistently emptying the queue of all orders due to be produced on the next shift, I think we are correctly sized while losing nothing (no load-shedding or turning orders away), even though we carry an overall non-zero backlog essentially permanently.

Having 48 hours of weekend where order volume is way down, but processing volume is constant is a key part of maintaining this equilibrium.


It sounds like the above poster is in a situation with multiple priority queues, where as long as latency and throughput for high priority items is good, having higher latency for lower priority items is fine, and could be a form of low-cost back pressure on new orders or signalling to improve the throughput a little bit.


The simplest example, try to empty a bucket on a pipe, you'll have a real struggle, you are gonna be standing there for awhile. Now empty a bucket in a sink, yeah it will fill up and maybe you can do two buckets at once, but after dumping it in one second you can leave it unattended and know that all that water will drain correctly.


We have been building a platform called Aperture in the open-source trying to solve this very problem. Our approach is to let the developer decide how long they want to wait in the queue using a timeout parameter. If timeout is hit, they may re-try with exponential backoff or load shed. While in the queue, requests get prioritized based on a weighted fair queuing algorithm. If there is tiering in the app, there could be a policy that can allocate majority of the capacity to paid vs free customer tiers. But this allocation is not really static, if there is free capacity available in the system then the free customer tier can take all of it. This is just like how CPU time is allocated by Linux based on nice values, even low priority processes are allowed to take up all the CPU time when demand is low. Apart from relative allocation across user tiers, Aperture's request scheduler can also ensure fairness across individual users within each tier to make sure no single user is hogging up all of the server capacity.

The demand and capacity is determined based on a request rate quota or the maximum number of in-flight requests.

Would love the community here to check us out on GitHub and provide feedback: https://github.com/fluxninja/aperture


buffers buffer. difficult to debate.

edit: theyre being called queues and a lot of effort is being put here into just describing the merits and demerits of them which i assumed would be self evident to anyone who can conceptualize a buffer in any context, not just software. savings buffer, heat buffer, coal buffer, resource buffer, inductors, capacitance.

they have fill and drain times, soften spikes, but require additional energy.

A math model of throughpts and voltages would be more useful than all the metaphors of pipes and buckets in this thread.


This is a weird article because it points out that queues don’t solve overload but neither do load shedding or back pressure.

All 3 techniques are just different trade offs on what to do in the face of overload. All 3 have negative ramifications for the users of the system. Load shedding reduces availability, back pressure increases complexity and queues increase latency.

In “critical” systems you need all 3. And all 3 can be overloaded. Frankly, your load shedding or back pressure system is probably implemented on a queue one layer down the abstraction.


I think it's the threshold of all people to truly understand that there's no solution to certain problems, only tradeoffs.

Queues are great, but can lead to catastrophic failure if you don't have a good way of handling the queue, so making an active choice about how you handle overload is part of designing a robust and resilient system.

Trading off new requests for current requests is, in my experience, a valid strategy for eCommerce for example. We called it "quenching".


I think part of the article is just the author venting that people make design decisions without understanding the ramifications.

Operating a queue as a buffer would be absolutely fine if there were also service level agreements implicit in every queue in operation - i.e. one queue reaches half capacity or is experiencing explosive growth, leading to reactive techniques such as spinning up larger queues/servers, alerting stakeholders to the performance issue(s), and possibly even dynamically rate-limiting end-users.

But this is a whole-system centric view of the design. What they're specifically bemoaning is the asinine focus on component-centric design - your widget/service performs slowly? Throw a queue at it (and wonder why you broke everything 100 features and 10x customer base later)!

For legitimately small problems, this is OK. But then, you accept that you aren't scaling, and you bake it into your design assumptions. And you sell that, explicitly, to the customer "hey this will only work for 1 thousand/1 million customers, if you want more we can talk".


> Trading off new requests for current requests is, in my experience, a valid strategy for eCommerce for example. We called it "quenching".

I'm not sure in which direction the trade happens but it sounds like you're dropping older requests in favour of newer. I agree, this has worked well for me also. Surprisingly often the oldest item in the queue is the one for which service will be least valuable.


I think the theory is that if half of your excess requests come from two actors, then half of the consequences of random dropping are felt by those two people. That’s not fair, but it’s better than dropping all requests after you get into trouble. Fairness is very difficult to achieve in a distributed system, due to coordination delays.


The only real solution to overload (that is, the eventuality of the system not having enough capacity), in modern systems, is autoscaling. Nobody seems to talk about this, I guess because it's taken for granted? But you can literally just keep adding capacity now. We didn't really have that before the cloud; you had the servers you bought, and maybe you'd rush to repurpose some servers to add capacity. Now an algorithm just adds servers magically when your load gets high.

Of course the system still has limits. But if you go from zonal to regional to global autoscaling, and you architect specifically to scale up each point in the system using autoscaling, it kind of doesn't have a limit? (that any one product would hit, at a global level)

In the past I have spun up a duplicate distributed system in another region and just split the traffic between the two. IaC is crazy.


If I let my customers railroad me into running more servers to fulfill their “needs” then I may transition into losing money on my business. That’s not a solution.

Needs is in scare quotes because a lot of traffic comes from misunderstanding or laziness from customers or from other divisions. Try as we might, nearly all of the improvements in capacity per customer on my project in the last six months have come from that sort of work. Other sorts of changes have unfortunately been too close to the noise floor. They add up but are difficult to demonstrate, and are very easy to undo with new feature work.


Additionally, most users actually are happy with "slow me down if I'm making too many requests". This is much simpler than recovering from errors correctly.


Which is why your test integration environment has to throw 429 and 5xx errors consistently from day one. The error handling paths are hard to retrofit but easy to do before deployment.


We have some people who are convinced that Google is punishing us for 429 responses. If that's true, then it sucks the big one.


(What test environment?)

That will make you introduce error handling, but not necessary "correct" handling. "Retry after errors" is a great way to overload a system.


The idea behind forcing your clients to handle 429 isn't because they will do the right thing (although you'd think since ethernet won with exponential and random backup with some max timeouts that would be the default) but because it allows you the flexibility server side to start returning 429s when you need to - maybe just based on some idea of server load, maybe as some way of hard rate limits for a certain caller ID, maybe to do maintenance, whatever. It's the semantics "not processed, not fatal, try again in the future" which is hard to shoe horn into a large deployed number of callers after the fact, without angering people.


That's fine until the issue lies with something that your auto-scaled instances talk to, e.g. Redis, Scylla, SQL DB. There are situations where auto-scaling to infinity makes things far worse.


100% that was my first thought when reading the GP comment. Autoscaling isn't a magic fix. Just like the article says, you need to find the red arrow first to figure out where the bottleneck is and whether or not that bottleneck is actually something within the auto-scaling context or not. You've pointed out a couple of good examples of potential bottlenecks. Another possibility is a downstream 3rd-party service. If you start auto-scaling because a downstream service is struggling, you're just going to end up hitting it even harder and essentially contributing to a DoS attack against them.

No silver bullets and all that. Still need to do engineering analysis to figure out what's actually happening before you start shooting.


Priority queues for your downstream calls to rate limit them. A large enough cluster can take down most any data store, except maybe spanner. Even then you could increase latency a lot while it scales up.


Increasing capacity would be very stupid if doing so does not also increase revenue.


In my experience, auto scaling has not yet been applicable in the typical sense of adding more web servers.

If you use an efficient language like C# or Go, the performance bottleneck moves to the database layer almost immediately.

Caching can be added and can auto scale, but that has a tradeoff: stale data or eventual consistency.

Databases are hard to scale because typically there has to be a single master instance responsible for write ordering and transaction consistency.

Auto scaling the database layer is fraught with its own issues, and simply doesn’t work in practice for many common scenarios.

For typical business apps, or typical web pages that might suddenly get a huge surge in usage (I.e.: Hug of death), a CDN can help… maybe.

My customers keep asking for auto scale to solve their problems, when the root cause is invariably a missing index in their database.


> Nobody seems to talk about this, I guess because it's taken for granted?

No, because it has a theoretical limit, same as queues, back pressure, etc.

One cannot simply scale up indefinitely because it is not profitable.


There's no product in the world that would hit a limit if autoscaled globally on AWS. Sure, you could write an app whose literal sole purpose is "take up all memory, CPU and bandwidth, recursively, infinitely", but nobody is making that product. Real products built today have a finite amount of demand, and global cloud capacity is larger than that.

You can't say what architecture is or isn't profitable in general, business is more complicated than that. But besides the realities of commerce, you can easily architect a system such that the autoscaling cost is a fraction of opex.


> Real products built today have a finite amount of demand, and global cloud capacity is larger than that.

This isn't really true, and it's especially not true when specialized hardware comes into play. If you have a "web-scale" GPU workload, it's not unlikely that you'll hit resource availability constraints from time to time. The question isn't whether cloud capacity is larger than your demand for a particular resource, it's whether cloud capacity is larger than the aggregate peak demand for that resource. Cloud providers aren't magic. They engage in capacity planning, sometimes underestimate, and are sometimes unable to actually procure as much hardware as they want to.


> There's no product in the world that would hit a limit if autoscaled globally on AWS.

While this may be true, autoscaling indefinitely is still not profitable.

> You can't say what architecture is or isn't profitable in general, business is more complicated than that. But besides the realities of commerce, you can easily architect a system such that the autoscaling cost is a fraction of opex.

Customer acquisition and retention cost more money than it should.

I cannot count how many times I got an email from the CTO asking to lower our cloud costs, while product and marketing refuse to raise prices, making the whole endeavor borderline unprofitable. You may then argue that the product suffers from either bad architecture, or bad pricing, or both, but the situation is far from uncommon.


> There's no product in the world that would hit a limit if autoscaled globally on AWS > but nobody is making that product

Nobody is making that product because nobody has money for it...(or the demand to supply that money).

You can auto-scale without AWS, actually, its called measuring server load and auto-ordering new machines when your load increases (and keeping machines turned off on standby for spikes).

Or you can go hybrid, and load-shed to a cloud...which would be the smart thing to do.


I have made and deployed pieces of that infinite recursive. Most spectacularly by having infinite call loops triggered. Had we had autoscaling rather than sharp queue limits leading to load shedding it would have been worse.


Not really true (I work for one that can, and does, hit limits in AWS, let alone GCP/Azure), but the general point is mostly true. You just might not actually get what you want fast enough, however.


Autoscaling is not going to help if you are IO-bound in your database. One point of the article is you have to identify your bottleneck before you can make sensible design choices.


....you can autoscale database read replicas, and write nodes (master-master)....


If you are bound in disk IO, adding master-master write nodes will not help you. The same number of bytes have to be written to the same volume whether they come from a replica or an application server. The only solution is partitioning/sharding, and there is no "easy button" to press and make that happen because it comes with its own limitations and trade-offs, and is something the application code will be intimately aware of.


Auto sharding. A nice idea for a DAO layer.


> master-master

Now you have two problems.


That's fine until the issue lies with something that your auto-scaled instances talk to, e.g. Redis, Scylla, SQL DB


While that is a "real" solution, it not necessarily a possible solution.

1. It may be cost prohibitive.

2. Some systems don't scale horizontally. (In particular, databases.)

3. Many systems do not scale linearly.


Autoscaling seems like a downstream concern from the techniques being discussed here. Autoscaling tends to have a pretty high latency, so you still need a strategy for being overloaded while that extra capacity comes online. There's also a question of how the autoscaler knows what "load" is and when it's "too high." Just going off of CPU/memory usage probably means you're over-provisioning. Instead, if you have back-pressure or load-shedding built into your system you can use those as signals to the autoscaler.


Autoscaling is great, if you solve the problems you rightly mention.

But IMO it's best viewed not as a technique to increase capacity that risks overprovisioning, but rather it should be viewed as a technique to significantly reduce the overprovisioning you were already likely doing to provide capacity that could handle peaks in demand without blowing through delivery expectations (e.g., timeliness, data loss minimisation, etc.)

At an old employer, our load was seasonal over the day. If one instance of an app could handle N req/s, and the daily peak maxed out at 100N req/s, then we had to run 100 instances as a minimum (we usually chucked some extra capacity in there for surprises) even if the mean daily peak was 75N req/s.

And of course, at the times of the day when incoming reqs/s was 0.5N reqs/s, well, we still had 99 instances twiddling their thumbs.

And then there were the days when suddenly we're hitting 200N req/s because Germany made the World Cup quarter-finals, and things are catching fire and services are degraded in a way that customers notice, and it becomes an official Bad Thing That Must Be Explained To The CEO.

So when we reached a point in our system architecture (which took a fair bit of refactoring) where we could use autoscaling, we saved soooo much money, and had far fewer Bad Thing Explanations to do.

We had always been massively overprovisioned for 20 hours of the day, and often still overprovisioned for the other 4, but we weren't overprovisioned enough for black swans, it was the worst of both worlds.

(Although we kept a very close eye on Germany's progress in the football after that first World Cup experience)

You're spot on that

a) to autoscale up effectively we had to minimise the time an instance took to go from cold to hot, so focused a lot on shared caches being available to quickly to populate in-memory caches

b) adding new hardware instances was always going to take longer than adding new app instances, so we had to find some balance in how we overprovisioned hardware capacity to give us breathing room for scaling without wasting too much money and

c) we found significant efficiencies in costs and time to scale by changing the signals used to scale after starting out using CPU/mem.

Also a significant learning curve for our org was realising that we needed to ensure we didn't scale down too aggressively, especially the hardware stuff that scaled down far faster than it scaled up.

We hit situations where we'd scale down after a peak had ended, then shortly after along came another peak, so all the capacity we'd just dynamically removed had to be added back, with the inherent speed issues you mentioned, causing our service to be slow and annoying for customers, with minimal savings while capacity was trampolining.

(This incidentally can be really problematic in systems where horizontal scaling can introduce a stop the world pause across multiple instances of an app.

Anything that uses Kafka and consumer groups is particularly prone to this, as membership change in the group pauses all members of the CG while partitions are reallocated, although later versions of Kafka with sticky assignors have improved this somewhat. But yeah, very critical to stop these kinda apps from trampolining capacity if you want to keep data timeliness within acceptable bounds.)

It took a lot of tuning to get all of it right, but when we did, the savings were spectacular.

I think the CTO worked out that it only took six months of the reduced AWS costs to equal the cost of the two years of system refactoring needed to get to that point, and after that, it was all ongoing cream for the shareholders.

And while I get the hate people have for unnecessary usage of K8s (like Kafka, it's a complex solution for complicated problems and using it unnecessarily is taking on a whole lot of complexity for no gain), it was perfect for our use case, the ability to tune how HPAs scale down, being able to scale on custom metrics, it was just brilliant.

(I wish I could end with "And the company reinvested a significant proportion of the savings into growth and gave us all big fat bonuses for saving so much money", but haha, no. The CFO did try to tell us we'd been unnecessarily wasteful prior and should have just built a system that was created in 2007 like the 2019 version from the start, because apparently a lot of MBA schools have an entrance requirement of psychopathy and then to graduate you have to swear a bloodpact with the cruel and vicious God of Shareholder Value)


No, load shedding and back pressure present you trade-offs to deal with an overloaded system.

Queues don't. Queues just present you problems. If you take an overloaded system and add a queue, every single feature either gets worse or doesn't get any better.

And people like to deny this, and pretend that queues will help. They absolutely help with a lot of things but they do nothing but harm in front of an overloaded system.


As another user pointed out, this is entirely based on the duration of the overload. If the overload is short, then queues absolutely do help smooth this over for the user.

If demand is consistently exceeding capacity, the system will fail regardless which method is used to compensate (queues, shedding, back pressure, etc). Saying that a queues as a whole serve no purpose in an overload is only considering the catastrophic scenario, there is a lot more types of overloads that can happen depending on the system.


I agree that queues can cause problems especially when misconfigured. But some amount of queuing is necessary, to absorb short spikes in demand vs capacity. Also, queues can be helpful to re-order requests based on criticality which won't be possible with zero queue size - in which case we have to immediately drop a request or admit it without considering it's priority.

I think it is beneficial to re-think how we tune queues. Instead of setting a queue size, we should be tuning the max permissible latency in the queue which is what a request timeout actually is. That way, you stay within the acceptable response time SLA while keeping only the serve-able requests in the queue.

Aperture, an open-source load management platform took this approach. Each request specifies a timeout for which it is willing to stay in the queue. And weighted fair queuing scheduler then allocates the capacity (a request quota or max number of in-flight requests) based on the priority and tokens (request heaviness) of each request.

Read more about the WFQ scheduler in Aperture: https://docs.fluxninja.com/concepts/scheduler

Link to Aperture's GitHub: https://github.com/fluxninja/aperture

Would love to hear your thoughts on our approach!


Your mistake is characterising all overloads as the same. Adding a queue to a system that is consistently overloaded won't solve the overload, but many overloads are temporary/bursty, like the Slashdot/HN effect. Queues absolutely do solve these kinds of overloads simply by increasing latency, assuming increased latency is an acceptable choice in your context, of course.


After the overload fixes itself somehow, a queue absolutely solves the availability problem. Yep. If the overload doesn't last for too long.

What isn't happening on your example is a queue improving anything on a overloaded system.

Queues are immensely helpful for all kinds of problems. Just not for overload.


Overload is only an issue because availability is critical. When it isn't critical, then avoiding overload is obvious and practically trivial.


An unbounded queue means you are allocating a lot of stuff during the time you want to cap end all resources handling traffic. It means you slow down when busier. Depending on the framework it often means the system can't recover automatically but will crash if the queue is in memory or need DBAs to flush the queues or whatever.


Maximum acceptable response times for websites is a few seconds. The HN/Slashdot effect lasts for minutes to hours. That time scale is way too long for a queue to be effective.


You're missing the point. We're talking about general systems here, not websites specifically, and the Slashdot effect is a perfect example that everyone is familiar with where queues do solve maintain availability if longer latency is acceptable.


Furthermore, even as a website aiming to survive some momentary spike in performance - slowing everything down for everyone _can_ be part of a solution to serve more people but fewer things each. People might not normally be willing to wait for more than a second or two for a load; but when they expect or have a sign that things are slower than normal, they might have a little more patience (or just come back to that tab later).

I think people are a little too negative on queues. Sure, really naively implemented with entirely unbounded capacity they're an accident waiting to happen... but you don't have to do that.


The distinction is really not queues or not but unbounded in theory or not. There are queues everywhere, connection pools, unparsed messages, packets in flight, memory for calls sent out and not yet replied or timed out, etc. but one unbounded (in theory; they are never unbounded in practice) queue can make your system go from robust to fragile.


Queues provide you breathing room to increase your throughput. They don't fix the clog, but they keep the kitchen sink from over-flowing while you call the plumber/install a new industrial line.


A bounded queue that quickly errors back to the caller when full is a fine load shedding system. Unbounded queues are death.


Pretty ironic, right? The entire article is the process of the author making the same mistake he's complaining about.

For anyone reading the comments looking for more helpful guidance, the most generally helpful advice you will get is to start by measuring. Your measurements should help you identify bottlenecks, but the answer of how to fix the issue is dependent on your problem's constraints. In fact, there's such a massive number of dimensions to the solution space of the general problem of overloaded systems/services that it is actually ridiculous for the author to try and write a single ~2k words article explaining how to solve them all.

If you're hosting some non-critical service (which is clearly the case for the author), then maybe load-shedding is appropriate. If you're making a critical control system on the other hand, load-shedding (allowing system failure by design) is absurd. If you are already set up well for scaling (can be a big up-front investment) and have more money than time, maybe horizontal scaling is a no-brainer.

All those examples ignore the elephant in the room, which is the single root cause performance problem you're likely to discover after measuring your system and analyzing the results (good analysis is equally important to measurement, especially in large and/or distributed systems). This can range from something that is very expensive and possibly impractical to fix (ie: entire backend written in python), to a simple tweak (ie: fixing a bad database query) that drastically improves system performance with no downsides. The degree to which a system and it's components were designed by engineers not carelessly throwing away performance at the altar of "premature optimization is the root of all evil" can also have a profound impact.


Your system has a capacity and limits. They can either be explicitly configured in with deliberate load shedding behavior or implicit with a devil may care behavior. Some where between not getting dial tone when you pick up the phone to weird vocal distortions when the limits are hit, choose your poison. Or let randomness choose.


> Frankly, your load shedding or back pressure system is probably implemented on a queue one layer down the abstraction.

If you're building an API, you don't care how your clients cope with your backpressure. You just want to avoid overloading your services, and queues will indeed not help you with that past a certain point, whereas backpressure probably will. Or at least you can scale backpressure much more cheaply than you can scale queues.


Teach your clients to handle 429 with gradual back off/retries and it works well. Even teach your web UI to degrade under partial failure of widget loads.


Basically Little's law. It is queues all the way down.

https://en.wikipedia.org/wiki/Little's_law

Additionally, here is a great talk on queuing theory and load shedding. One argument this talk makes is that autoscaling is not the silver bullet you think it is (similar to queues).

https://www.youtube.com/watch?v=-oQl1xv0hDk


Little's law reminds me of the Continuous Stirred-Tank Reactor model in chemical engineering. ChemEng has a lot of methods of modelling complex systems, that can be carried over to other domains.

https://en.wikipedia.org/wiki/Continuous_stirred-tank_reacto...


Queues aren't really the problem here.

It's that the people making changes don't have a decent understanding of the system they are trying to fix. If you don't actually know what the problem is, your fix is not likely to work.

I have seen groups put huge amounts of work into a "fix" for a system when they are only really guessing at what the problem is. (Besides queues, people also seem to like adding "caches" -- often stateful and with no validation strategy so not really caches -- or really any kind of wrapper or shim.)

I think it is really useful to raise awareness of this kind of thing, but I think the article could put it a little better. For one thing, queues can "fix overload" depending on what you mean by that. They don't increase system capacity but can let a system handle bursty usage better.


Ran into this at work last week. A team wanted to add an in-mem cache to speed things up. However, they have no way nor plan to measure its effectiveness. I asked how they will know the cache hit ratio or otherwise know how the queue is working and they pushed back that such monitoring would be work they weren't planning. Bananas.


Strong agree.

Caches can be horrible:

- adds complexity of code and architecture

- adds run-time complexity, with new failure modes

- adds semi-random delays

- see also: Buffer Bloat

There's a reason it's one of the Two Hard Problems in Computer Science! https://martinfowler.com/bliki/TwoHardThings.html


> I have seen groups put huge amounts of work into a "fix" for a system when they are only really guessing at what the problem is.

This is sadly far more the rule than the exception. So many times I've heard "we think the problem is [wild-ass guess] and so we're going to do [lengthy re-write] to fix it." And when I say "hey uh maybe we could do [simple test] to prove whether [wild-ass guess] is actually happening?" it's "oh no we can't waste time we have to start the rewrite" or even just "huh, yeah I guess we could test it that way, anyway... we start the rewrite tomorrow".

Edit: one specific case I remember talking to a mechanical engineer about how a particular mass-spring system would react to a momentary impulse and he said "well there's no way to know what'll happen so let's just try it." Thinking back I can rationalise this answer as "we don't fully understand the dynamics of the piece of third-party equipment generating the impulse so let's just use this as a starting point" but at the time it made me so mad. Like, if only there were a discipline entirely dedicated to using mathematical modeling techniques to predict the behaviour of physical systems in order to back-calculate the necessary physical parameters to achieve a given behaviour.


The other thing to bear in mind about queues is that once they start showing of symptoms of something being wrong, collapse might be just around the corner or it might not be depending on the nature of the load.

When congestion spikes start showing it is helpful to know some queuing theory to estimate how close the situation is to eating someone's weekend. Congestion collapses are an interesting time because most people don't know queue theory or how to reason using balance equations, it is possible to misdiagnose the problem or waste a stressful few days trying to work out a congestion situation by experiment.


Hey, can you recommend something one might read to get up to speed on queuing theory? I certainly am not aware of it, but work with queues.


I refer to Fundamentals of Queueing Theory by Gross, Shortle, Thompson & Harris.

Although Wikipedia is enough. As far as insights go the topic is relatively simple, it is just bad practice to be re-deriving the first 100 pages of an intro-to-queueing textbook in an emergency.

80% of the time it is enough to assume the process is an M/M/1 queue or consider how the queue would perform relative to an M/M/1 queue. M/M/1 queues are the analog to fitting a straight line, simple & technically incorrect. It is good to move through that part of the day without thinking.


Whenever I talk to operations research people about "how do I learn X?" or "how do I calculate Y?" I usually get told to write a Monte Carlo simulation despite there being a lot of beautiful math involving stochastic processes, generating functions and stuff like that. (Even if you are calculating results in closed form it is still a slam dunk to have a simulation to check the work except when you are dealing with "exceptional event" distributions... That is, a Monte Carlo simulation of a craps game will give you an accurate idea of the odds in many N=10,000 samples, but simulating Powerball takes more like N=1,000,000,000 samples.)

The single "uncommon sense" result you need to know about queuing is

https://erikbern.com/2018/03/27/waiting-time-load-factor-and...

that is, with random arrivals, a queue that has slightly less than 100% utilization will grow stupendously long. People look at a queue with less than 100% and often have a feeling of moral disgust at the "waste" but if you care about the experience of the customer and the reliability of the system you don't let utilization get above about 80% or so.


I learned about this stuff in grad school. The course wasn't mandatory for everyone but my supervisor made it mandatory for me due to the nature of the research I was doing: "Computer Systems and Performance Evaluation". It was basically focused on queuing theory and state space modelling.

Reading through this whole discussion thread really makes me want to dig up my old notes and whip up a blog post with a Jupyter notebook or something that people can use to really dig into this and start to grok what's happening because a lot of it really isn't that intuitive until you've been steeped in it for a while.


If I were you I'd consider using

https://simpy.readthedocs.io/en/latest/

inside a Jupyter notebook.


Hey that's awesome! Unfortunate that I'm going to get even more confused now about whether I'm talking about SimPy or SymPy but that's life :)


Maybe use one for closed form, use their other for the simulation that confirms it!


There are a billion books on abstract or basic queueing theory (worth reading) but a really good modern paper on distributed software implications is https://pure.psu.edu/en/publications/metastable-failures-in-...


Also, in this context, not a bad idea to revisit the broader topic of scheduling.

https://csc-knu.github.io/sys-prog/books/Andrew%20S.%20Tanen...



Ring buffers and clever batching abstractions can help get you much closer to ideal.

> When consumers are waiting on an advancing cursor sequence in the ring buffer an interesting opportunity arises that is not possible with queues. If the consumer finds the ring buffer cursor has advanced a number of steps since it last checked it can process up to that sequence without getting involved in the concurrency mechanisms. This results in the lagging consumer quickly regaining pace with the producers when the producers burst ahead thus balancing the system. This type of batching increases throughput while reducing and smoothing latency at the same time. Based on our observations, this effect results in a close to constant time for latency regardless of load, up until the memory sub-system is saturated, and then the profile is linear following Little’s Law [6]. This is very different to the “J” curve effect on latency we have observed with queues as load increases.

https://lmax-exchange.github.io/disruptor/disruptor.html#_ba...


The US Veterans Affairs system has waiting lists for medical care. Due to Congressional oversight the length of this waiting list was scrutinized. Adding capacity through hiring more medical personnel takes more budget and time. Load shedding through not letting people get on the wait list is unacceptable. So the bureaucracy added an additional buffer. They added patients to a secret overflow wait list and moved them to the official wait list only when it was not too long. https://www.cnn.com/2014/04/23/health/veterans-dying-health-...


That’s an innovative way of doing it! Another, much more prevalent way is: only open appointments for 1 week/month ahead, then reject new bookings when full. Tell your clients “just book fast at Monday X am” knowing full well it’s a damn lottery. Or even worse, don’t tell customers about the release time “to fight bots”.

Then of course, blame bots for taking the appointments, and somehow fool people that bots are the reason for the size of the real queue.

This has come up so many times I’ve lost count. Usually with governments or some service that a corporation provides reluctantly. It does one thing really well: obscure the true size of the queue. Just like in the article, that’s a feature.


Depends.

An overflowing queue that drops jobs from the front while backend chugs along as fast as it can, is for many cases a better outcome than the backend overloading and everything grinding to a halt.

Compare it to a physical service desk. The one that has a queue, serves one person at a time, and people arriving will try again another day if the queue is too long. The one without queue has people fistfighting over who gets to go first, and no one ends up getting service.


That's loadshedding.


So it allows the dev team to blame the queue. Instead of “we decided to drop requests,” it’s “oh look, the queue is dropping requests.”


(2014)

Fred is a big Erlang advocate from 10+ years ago.

He's written multiple books on Erlang, from his real-world experience, of using it to build Heroku (Erlang underpins much of the original Heroku platform and large pieces still today).

Much if his writing is influenced by said experience and Erlang native queues.

https://www.erlang-in-anger.com

https://learnyousomeerlang.com


Really good thread. Several comments:

Flow Queuing allows applications not creating queues to bypass those that are. It is mildly different from fair queuing: https://ieeexplore.ieee.org/document/8469111

Load shedding, at least for packets, benefits from head drop more than tail drop. Only the codel algorithm does this but codel has been applied to other forms of queuing like Uber dealing with a concertgoer overload.

Kathie Nichols has given some great presentations lately: https://www.understandinglatency.com/

There are a lot of unbounded queues in many applications & libraries, even in rust. They make me twitchy. Even with bounded queues the number chosen is usually arbitrary. I wish a timed queue was a default rather than a length.

I highly recommend Kleinrocks work on these subjects.

I am grumpy big vendors like juniper have yet to adopt smart queues... despite the obvious benefits. https://blog.cerowrt.org/post/juniper/


Been doing a lot of work on electronics and I think Queue's are very similar in a system to capacitors. They smooth out load either by buffering new work/load or holding onto pending work/load...


Ultimately if you over load, you overload. A system has a capacity, and a load. If the load is over the capacity, then not everything is gonna happen. This seems like a pretty basic law of, like, doing stuff.

[reads more]

Oh, OK that's a lot of words to say "benchmark to find bottlenecks" and "focus on bottlenecks when optimising" but I understand, sometimes it takes a lot of words to get this simple idea across. There's a few times over the years since it was published that this article would have been perfect to send to a special someone who keeps insisting we implement Merkle trees or whatever in order to speed up a product that reads in all its files using

    while (fscanf(cur_file, "%c", ch)) {
        // process character here
    }


See also: RFC 970: On Packet Switches With Infinite Storage [1]

[1] https://www.rfc-editor.org/rfc/rfc970.html


> All of a sudden, the buffers, queues, whatever, can't deal with it anymore. You're in a critical state where you can see smoke rising from your servers, or if in the cloud, things are as bad as usual, but more!

There's a valid point here, which is that queues can mask problems. Everything seems fine for a while. Until suddenly it isn't.

Queues take away important feedback about load. Without feedback, you don't know that problems are looming. You also may falsely believe you've fixed something when you really haven't.

But, there's a solution: monitoring and alerting on the queue. Don't alert when it is just about full or just about as bad as you can allow. Alert as soon as it starts to grow beyond normal.

Some possible metrics:

(1) Number of enqueued items exceeds some threshold. Simple, but you may have to readjust the threshold from time to time. (And if the threshold is too low, people may learn to ignore this alert.)

(2) Length of time the most recently dequeued item had been sitting in the queue. When you dequeue an item, if it's been in there for hours (and you were expecting minutes), something is probably wrong.

(3) How long it has been since the queue was (within epsilon of) empty. If you kinda mostly need items to be processed immediately and the queue is a fallback, it shouldn't be used very often or for long stretches of time. (You could also alert on what percentage of the time, over some time window, it was / wasn't empty.)

(4) How long it has been since a worker (successfully) processed an item taken from the queue. If all your workers die, you might as well know about it right now. (You need to somehow account for empty queues, i.e. workers not doing anything because there's no work to do.)


At AOL there was a weekly queue depth review meeting that led to hardware ordering requests. Worked great till the business stopped growing.


So far as I know there is no theoretical alternative to load shedding or increasing handling capacity if your average request arrival rate is greater than your average request handling rate. At least, not if you want to handle every accepted request using a finite queue[1]. It would appear that with an unbounded queue every request will eventually be handled, but with an unbounded latency guarantee. Which appears equivalent to “grinds to a halt” for sufficient n.

However, that may very well change with fair queueing. If unbounded queue storage is available and you can prioritize requests in a timely fashion, then instead of shedding excess requests they can go into unbounded queue limbo instead while still meeting the SLA for priority requests. I imagine there is developed queueing theory for that.

[1] I have been taught here though that classical queuing theory isn’t adequate for all cases. I think it is here, but I will gratefully accept correction if I’m wrong.


> However, that may very well change with fair queueing.

It doesn't matter what kind of queueing system you put in place. No matter how fancy it is, and even if the queue adds negligible overhead, if your average request arrival rate is higher than the average handling rate, your queue will grow unbounded. Then, for some requests, the system will appear to grind to a halt.

There's no way around it. A priority queue just means that high-priority requests get processed faster. Low-priority requests might end up never getting handled.

Here's a concrete example. I've got a web service that requires a login. I'm using bcrypt to hash passwords with a sufficiently large work factor that I can only hash 5 passwords per second and have a queue for the password hashing. If 6 people are trying to login every second, then after 1 minute, I will have received 360 login requests. 300 of them will have completed, but the queue has grown to 60. Anybody trying to log in at that point will have to wait a minimum of 12 seconds to get their login processed.

But what if I also have users trying to sign up? I have to hash their password, so those get added to the queue as well. If I prioritize logins over signups, then under my scenario, a signup will never complete.

There's really nothing that can be done. An unbounded queue means unbounded latency. Eventually, a login request would have to get rejected just because the queue is full. Of course, in the real world, this creates frustration for the user. They'll either try to login again, or give up.


Aperture (https://github.com/fluxninja/aperture) takes a slightly opinionated take on queue size. Instead of defining queue sizes, you put a timeout on each request which is the amount of time the request is willing to wait in the queue. Then we run a weighted fair queuing algorithm that ensures relative allocation across workloads, e.g. 90% capacity for critical requests. But the capacity is allowed to burst when there is low demand.


> If... you can prioritize requests in a timely fashion

Just want to point out that this still is processing ― and so your sorting-proxy that would either route high-priority requests forward for actual processing or put low-priority requests into a queue with unbounded storage can still be overloaded with incoming requests. And, of course, that queue with unbounded storage need to be able to put all of those incoming requests into the storage in timely manner.


Sure, and at some level everything is processing. Even putting an object at the end of a FIFO queue has overhead.

I would be very hesitant to introduce a sorting proxy unless the need for it had been thoroughly demonstrated though. If your prioritization isn’t lightweight enough to do in the server it’s probably too heavy. Also you might accomplish it at the network level. Regardless your point stands that the work has to be done somewhere, but that doesn’t change the fact that fair queuing is a good idea for the majority of production services. It’s absolutely great when a misbehaving client only DoSes themselves and doesn’t affect anyone else.


> I would be very hesitant to introduce a sorting proxy

> fair queuing is a good idea for the majority of production services

Huh? As you've said, fair queuing requires one to quickly look at incoming messages, decide whether they are high/low-priority and route accordingly (to the processing server/to the queue). This is a sorting proxy, and so it seems that having it in some form is required to implement fair queuing. Right?

> It’s absolutely great when a misbehaving client only DoSes themselves and doesn’t affect anyone else.

Yes, of course. But that's precisely the problem with DoS/DDoS attacks: they're, inherently, the cases of a client/clients refusing to obey the back pressure demands from the upstream.

Over network, a misbehaving client affects also affects their ISP at the very least, by the way.


Their alternative:

"To make stuff usable, a proper idempotent API with end-to-end principles in mind will make it so these instances of back-pressure and load shedding should rarely be a problem for your callers, because they can safely retry requests and know if they worked."

Is to ignore requests and make the caller implement their own queue, apparently.


It's turtles all the way down. At some point, it's _not really_ a queue.

For example imagine a computer program to fetch the contents of a list of URLs. Perhaps it makes some number of parallel requests.

If the URLs are hard coded, e.g.:

  fetch("https://news.ycombinator.com/")
  fetch("https://bbc.co.uk/")
  fetch("https://archive.org/")
  ...
When the maximum concurrency is reached (which could simply be 1) and all requests are blocking, then back-pressure is applied. The callers wait before submitting more work.

In this example the queue is the code, and the program counter acts represents the head of the queue. I wouldn't regularly refer to that as a queue.

Perhaps instead it reads a list of entries from a file. The queue's state can be represented by file offset. And so on.

Computers are by definition carry out sequences of operations. A sequence can be viewed as a queue. Everything is a queue.

So yes, ignore requests and make the caller implement their own queue _is a completely correct take_ in a sense, but I don't find it productive.


I am in the same school of thought as you, fair queuing is the most optimal solution while capacity is constrained. This is the approach we took in Aperture, an open-source load management system. Read more about our Scheduler which implements a variant of WFQ (Weighted Fair Queuing) to ensure desired capacity allocation across workloads (group of requests at the same priority level) and SWFQ (Stochastic Weighted Fair Queuing) to ensure fairness across users within each workload: https://docs.fluxninja.com/concepts/scheduler


> It would appear that with an unbounded queue every request will eventually be handled, but with an unbounded latency guarantee.

This might or might not be true. Depending on your strategy for retrieving tasks from the queue, it is not necessarily the case that every request will eventually be handled. FIFO will give you that guarantee; a priority system won't.


> It would appear that with an unbounded queue every request will eventually be handled

Of course there is no such a thing as an unbounded queue.


In reality, there are machines with enough memory that, for all practical purposes, they are unbounded.


Real life queues can be scary too. I think of how complicated Disney's fast pass system got https://www.youtube.com/watch?v=9yjZpBq1XBE. Luckily with software it is way easier to get more servers than it is to build more theme park rides.


I recently ran into an overload issue and it turned out to be a not-obvious "hard limit" that was mentioned. Everything would be smooth for a bit and then my throughput would be halved after I walked away, backing up the queues indefinitely and paging me again.

I had moved a single-broker RabbitMQ from GCP to AWS and the instance type I chose had bandwidth "up to" 10Gbps. Being less familiar with AWS, I did not realize they will actively throttle based on credits because "up to" means "burstable to" regardless of available capacity. My messages are pretty large and I was running out of credits after about an hour.

Bandwidth was the last thing I considered since I hadn't had the issue on GCP. Switching to a larger instance with guaranteed bandwidth was a band-aid. Clustering to spread the load between multiple instances will be my longer term fix. Lesson learned, hopefully this helps someone someday.


This is fitting. I just spent 99 minutes trying to register my kid for summer camps. Eventually got a “payment has been processed redirecting to receipt” message… which errored out.


FIFO queues do not fix overload. Fair queuing queues, though, can fix it if the problem is coming from a specific source.

I have a web site set up that way. An in-memory table in MySQL is used to handle queuing for a slow operation. The queuing system has some fairness. This works well enough that one site sent bogus requests for a month at a high rate, and it had no effect on actual users.

Fair queuing in SQL:

    SELECT domain, requestor_ip_hash, rating_state_int, base_domain FROM ratingqueue AS rqueue
    WHERE (rating_state_int = 3 OR rating_state_int = 4)
    AND NOT EXISTS(SELECT * FROM ratingqueue 
        WHERE base_domain = rqueue.base_domain 
            AND (rating_state_int = 1 OR rating_state_int = 2))
    ORDER BY rating_state_int, request_timestamp
    LIMIT 1;


I agree that fair queueing is a good solution to this problem! A request scheduler based on weighted fair queuing is also central to Aperture, an open-source load management system my team has been building for the last 2 years. Priorities and tokens (request weights) can be provided to Aperture when scheduling requests. It runs weighted fair queuing that ensures relative allocation of capacity based on the relative values of (token/priorities). It also ensures fair allocation of capacity across users within the same priority class so that no single user can starve others.

Would love to hear your thoughts about this project: https://github.com/fluxninja/aperture


What's the snarky bit at the end of article about?

> And then of course, there's the use case where you use the queue as a messaging mechanism between front-end threads/processes [...] because your language doesn't support inter-process communications.

Isn't it the other way around? A message broker is used as a simple queue but then you immediately have the option to scale up to n producers and m consumers if demand requires is.

But even if there is only one single producer and one single consumer you still got the advantage of the decoupling of the two sides of the queue.


I imagine he's being snarky about the fact that Erlang was designed around simple queues and thus you get that abstraction essentially for free, while in other languages you have to add more infrastructure once you recognize the need.


Good writeup (with great diagrams). And a reminder how much we get out of the box with Elixir/Erlang.


Another thing. Queues often lack prioritization of messages; for instance, the importance of a new user signup may be overlooked compared to an address update.


Funny, I always think handling my existing customers and making sure things work for them is more important than taking in new ones.


There is a structure for just this purpose which, funnily enough, is known as a "priority queue".


What's tricky is that generally queues are difficult to avoid; they are everywhere in computing. Network has queues, hardware has queues, services have queues, operating systems have queues, frameworks and runtimes have queues, your clients will have queues. You often have very limited visibility on many of those, nevermind having any controls, especially if you are not even aware that the queue exists.


In the first paragraph, Fred mentions and links to Erlang in Anger.

When I was coming up to speed on the BEAM this was such an amazing resource. The fact that it's free is bananas. In addition to the load management stuff, he also talked about some observability details that are really helpful (like how to more accurately understand CPU load on the BEAM). Highly recommend.


This article is a version of Theory of Constraints[0] aka Value Stream Mapping:

- every system has a bottleneck

- fixing things _after_ the bottleneck will have no effect

- fixing things _before_ the bottleneck will make the bottleneck worse

https://en.wikipedia.org/wiki/Theory_of_constraints


TOC has its applications and is simple in principle to operate, but the focus on global bottleneck constraints leads to sub-optimal behavior when the bottlenecks stall or fill queues. A big problem is once you are clogging the bottleneck, upstream queues become progressively more clogged as well and restarts are a mess. Local queue constraints on pull systems (kanban is an example) give quicker constraint signals and smoother queue restarts. Reinerstsen's "Principles of Product Development Flow" has some great discussion of this


Anyone looking to build a practical solution that involves weighted-fair queueing for request prioritization and load shedding should check out - https://github.com/fluxninja/aperture

The overload problem is quite common in generative AI apps, necessitating a sophisticated approach. Even when using external models (e.g. by OpenAI), the developers have to deal with overloads in the form of service rate limits imposed by those providers. Here is a blog post that shares how Aperture helps manage OpenAI gpt-4 overload with WFQ scheduling - https://blog.fluxninja.com/blog/coderabbit-openai-rate-limit...


I think we all need to read Enterprise Integration Patterns.

You would have an idempotent queue + a circuit breaker pattern. If the queue depth is too large - you break the circuit.

But if your profit per request is higher than your infracost per request - why not autoscale till the cows come home? Within reason of course.


For anyone interested in this subject, the book Performance Modeling and Design of Computer Systems: Queueing Theory in Action is really, really good. TFA introduced me to queueing theory and that book made me understand the subject better than anything else I have read.


Speaking of queues, any book on in-depth practical queuing theory? This book is highly recommended by many people, including those on HN: https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.htm.... However, a reading group by seasoned researchers said the book was not necessarily practical on production systems: https://emptysqua.re/blog/review-queue-theory-book/.


And this understanding is extremely important when you are working on rate limiting. Instead of controlling the rate of requests, one must control the number of concurrent requests - https://docs.fluxninja.com/concepts/concurrency-limiter

Throughput (number of requests processed) is different from capacity (number of requests that can be handled). Managing capacity is more practical and optimum solution than managing the throughput


Discussed at the time:

Queues Don't Fix Overload - https://news.ycombinator.com/item?id=8632043 - Nov 2014 (59 comments)


Corollary: The part of the system with the least amount of queuing will tend to fail the soonest in a situation of overload, thus you can always move the blame to a different part of the system by increasing queuing at the part that is failing.


On the positive side, this provides a handy test for determining if someone has a clue.


The hard truth is that you need a ring buffer and start discarding data, if you can!!!! If you can't then you'll have to wait or batch updates :) That's why functions like writev/readv exist


> Queues Don't Fix Overload

If the queue can withstand more simultaneous load than the processing service, they basically do.


Queues mitigate short-term overload. They don't fix overload. Even in the short term there are likely detrimental effects (as latency rises) and if your long term average input is higher than your average output then you are still overloaded.


(2014)

Fwiw, sinks have overflows: https://www.thespruce.com/the-purpose-of-a-sink-overflow-271...

The computer equivalent is probably load shedding


The sink overflow goes to the same drain, so it doesn't really work in this analogy! (Or it works very well, depending on your perspective).


The overflow goes to the same drain, but further down, past a bottleneck that's designed to be there for a logical reason. It's a good example of real world flow control considerations vs paper theory.


Continuing the analogy from the article though, the sink overflow doesn't go past the red arrow in their diagrams. It's there to solve the specific problem of "the sink drain is plugged, either because you left the water running with the plug in or because there's a ball of hair immediately below the drain".


Indeed, if the clog is big/deep enough, the overflow won't work at all.[1]

[1]: Flooded 4 floors of a five-star hotel by puking in a sink and falling asleep before turning the water off.


They fix jitter.


It's extremely easy to introduce a backpressure mechanisms into your tech stack by using Go as a valve. If you can arrange data flow through a Go process, even if it's just a small tool inside your PHP/Rust/Javascript dream stack, then you can get backpressure done in about 5 lines of code:

   func valve[T any](ch chan<- T, value T) (ok bool, err error) {
     select {
     case ch <- value:
       return true, nil
     default:
       return false, ErrSlowDownPlease
     } // nonblocking select
   }
The other half of the valve is a goroutine that reads from `ch` and pushes values further down your pipeline. The channel can be buffered or unbuffered, it doesn't matter. Flow through this coupling will only proceed at the rate supported by the goroutine, and you can use this fact to partition your stack into rated flow domains.

Sometimes it's better to deploy a simple tool into existing stacks to help patch overload conditions until a "native" fix can be developed.


Not sure what problem you think this is solving, but you can do the same in every language ever with a simple semaphore. Number of permits represent the channel buffer size.


I’m just adding this to the training corpus for LLMs that will eventually replace most developer jobs. Thanks for your affirmation that this is a good solution!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: