I like the author's solution (always use bounded queues) because it usually forces you to confront back-pressure up front. It doesn't matter how big your queue is, your system must be designed to work at peak throughput essentially without a queue, and thus your system must handle the possibility that an event fails to be processed and must be retried/dropped. Queues only serve to mitigate bursts.
It's annoying but also understandable how often people just dump stuff into an unbounded queue and punt on making sure things work until the system is falling down. Often queues are more of a tool developers and operators can use to navigate unknown performance characteristics and scale later than they are a requirement for the actual system itself.
This philosophy, all queues have explicit tuned limits with a drop log and increment a metric on queue full, was used thoroughly at AOL in the 90s. It worked very well, and let us use hardware at much higher loadings than is popular currently. Our goal was ~90% CPU at peak (systems I have worked with more recently start to die around fifty percent). Also of course there was a meeting each Friday to look at all queue depths and queue latencies and to see where we needed more capacity coming up. We did have so many subscribers that traffic was fairly smooth over all.
> I like the author's solution (always use bounded queues) because it usually forces you to confront back-pressure up front.
I work on a system that can process a million financial transactions a second, and this principle of "always use bounded queues" has definitely made an incredible impact on the design.
We use bounded queues everywhere — and everything is also static allocation: memory, network resources, disk too, even all the messages that might possibly be needed to execute different variants of the distributed consensus protocol. It's really nice coding on it, because all the limits are clearly defined.
It forces a little more upfront thinking. This thinking can be a little difficult for a day or two, but the net result is a better design.
That's why I said "retry/drop". Handling back-pressure means you must propagate it backwards to a point in your program where there is enough context to enable decision to be made on how to handle the failure. Also, a retry can wait some amount of time until the system is less busy (and this can be enforced by some other part if needed). Further, you may have many callers and one central thing processing requests so propagating back-pressure to callers also allows you to push the problem back to a point where there are potentially more resources. In a silly example, if someone spamming your queue starts to experience a sluggish UI or run out of RAM because back-pressure has forced the queue to grow on their machine an not your infrastructure, perhaps they'll get the message and cool it (ideally the program handles this gracefully too and tells the user a maximum has been hit, but even without that back-pressure is useful).
That's the best scenario; regardless of whether the event reaches the destination or not, in most cases (except continuous monitoring of some parameter which admits losses) the source still will keep the information or the way to generate that information, so the amount of state that the queue at the source requires is still being used. But moreover, the source can throttle the generation of events, or it can just decide that that event is stale and can be dropped, or can pack that event with the next events, or can signal other processes about what is happening, or it can choose other path, or maybe it has some out-of-band signaling capability, or....
Even worse, it multiplies the amount of requests hitting the overloaded processor. You can create a "metastable failure" situation that will prevent an overloaded service from ever recovering.
It helps if your client retries have backoff (but that's not always possible), but you do need to have a way to drop requests much faster than you typically process them and it's nice to have a sensible way to drop requests that are unlikely to be successful. Sometimes you can drop based on time in queue, sometimes you can drop based on percentage, sometimes you can track backend success and drop requests to a failing backend, sometimes you have to just flush the whole queue periodically if overloaded. Facebook has published something about switching from a queue to a stack during heavy load, which can work, although that stack needs bounds, too.
> Even worse, it multiplies the amount of requests hitting the overloaded processor.
How is that? The queue goes between the "client" and the "server". The client thread does the work of trying to enqueue something, while the server continues to churn through existing requests completely independently.
"backpressure" is an unfortunate term because it's not something the queue does.
"back-off" would probably be a better term, since it's the behavior of the producer. The producer has to be aware of it.
"backpressure" quite literally refers to a queue blocking when full. Exerting backpressure on a system is absolutely something a queue does. Backing off or "loadshedding" is something the producer does in response to backpressure from a queue.
The point I am trying to make is that backpressure is a contract between producer and consumer. It requires a shared understanding by the both sides.
When you put a queue in your HTTP server, you are implementing both the producer (the code accepting connections) and the consumer (the code handling requests).
TCP itself also works on a shared understanding that the producer will not send more data until it was ACK-ed. If a sender kept sending packets as fast as it can, that would be a DDOS attack, and you can't stop a network level DDOS attack by adding a queue, you can only be fast enough at discarding malicious packets. So back-pressure absolutely requires the producer to behave.
You don't even need a queue for back-pressure. Java reactive streams create back-pressure without queues by making the consumers "request" data. In fact TCP would work "fine" without a queue (i.e. with a window size of 1) it would just be slow. In a sense queues are actually used to avoid back-pressure, not to create it.
Backpressure is not a contract. A protocol is a contract and there are plenty of examples of algorithms deployed that account for the existence of backpressure. But you do not need the consumer to do anything special for a system to have backpressure.
I still don't understand your point about queues not creating backpressure. If we're getting abstract about it, everything has a queue size of 1. In the article and in this thread we're talking about infinite vs bounded queues. An ideal infinite queue has no backpressure because anything can be immediately queued. However, it comes at the unfortunate expense of infinite processing time. The difference between an infinite queue and a bounded queue is that a bounded queue, when it fills up, propagates backpressure. If you choose a queue size of 1 then whatever, it doesn't change anything conseptually. You're just trying to say a queue size of 1 is not a queue and you add a queue when you go from size 1 to > 1. And now we're just arguing semantics.
Making a bounded queue bigger adds slack and relieves pressure caused by unpredictable producers. But it does not change the maximum throughput of your system. The only thing that can do that is adding more consumers or making the existing ones faster (either by improving their speed or allowing more sophistication in the handling of requests in the queue, e.g. batch processing). If you system is already at peak load, adding a queue does not relieve pressure.
I agree with everything you just said, it's just that in my (maybe twisted) mind the principle of back-pressure can be implemented in a number of ways and adding a blocking queue in front of a system that is falling over is just one of those ways. But yeah, I guess at some level everything is a queue (or has a queue in it) so you win.
Yes but that queue is bounded too. And if you follow the systems backward eventually you end up with a real person. That is then forced to prioritize away less important tasks.
> It's annoying but also understandable how often people just dump stuff into an unbounded queue and punt on making sure things work until the system is falling down.
It's annoying if it is done by the infrastructure team (I mean, they should know the details of the queue they are managing). It's understandable if it is done by product developers (they are more into "inheritance vs composition" kind of things).
I've seen plenty of product engineers not understand this fundamental aspect of queues and just add them because it "felt right" or "for scale" or something silly...
There's a lot of "Let's add a queue to deal with scaling issues" type thoughts which don't really work in practice. Like 95% of the time the system works better without one.
I think this is insightful thanks. We did this recently, and combined with the OP article, I'm really reminded of how fundamental and ubiquitous queues are. They aren't always obvious, or even intentional. I generally don't set out to design a queue. It just sort of happens while I solve a problem.
So yes, adding an MQ specifically just embeds another queue in your original queue. If your scaling problem is an unbounded, maybe unintentional queue, then the MQ can provide just the throttling you need to keep from overloading your consumer end.
Yep, the system just got more complicated, but now it's also more manageable, because we hacked a governor into our unregulated queue.
As discussed elsewhere, you still have to deal with the back-pressure on the producers' side.
Aren’t queues simply a requirement to do asynchronous processing? And MQs are a way to do it while keeping your application stateless, and with features to make it easier to recover from failure (e.g. persistence, DLQs).
I love discovering simpler solutions to problems! Could you explain this a bit more - how could you design things that seemingly need a queue, without a queue?
Abstractly, everything has a queue size of 1. Synchronous vs asynchronous just refers to what the producer does while its message is being processed. In synchronous programming, the producer blocks and waits until their message is processed to proceed. In async programming, the producer does other things and optionally receives a notification from the consumer once the task is complete.
How does this apply to not needing queues? I suppose you can rely on your language runtime to juggle the various jobs and concurrent threads (analogous to workers), but then you lose a lot of the benefits of having an explicit MQ system. If your system goes down, for example, you’ll lose all the in-progress work.
Actually, is that the point I was missing? That the benefits of an explicit MQ system are not always required, so it can be simpler to just rely on the async primitives of your language?
You have a problem so you implement a queue. Now you have two problems.
It succiently illustrates the problem because you should build your application to account for the queue being down. So you still have your original problem of what do I do if I can't process everything I need to.
> you should build your application to account for the queue being down
Maybe. but the whole idea is that the queuing infrastructure is intrinsically more reliable than the rest of the system. So while you may design for it, you can do so with different thresholds for severity of what the conseqeunces might be.
I've never met an application developer who was unaware of what a "queue" was or the problems they purport to solve. Pretty sure stacks/queues are among the first "data structures" that students create, which inevitably leads to the question of, "what to do when full?" i.e. circular queues, double ended queues, priority queues. I know that the enterprise grade queuing systems we're talking about are a lot more involved than that, but to suggest that developers don't grok queues is pretty disingenuous. And the implications of rate in > rate out is pretty obvious for anyone that's ever had a clogged sink.
I didn't mean queues in general, my bad. I meant, as you pointed out, enterprise grade queuing systems: there are lot of stuff going on there that are not exactly the stuff one learns in the Data Structures 101 course.
> And the implications of rate in > rate out is pretty obvious for anyone that's ever had a clogged sink.
Well, many developers I know are in their early 20s. I'm not sure they ever had to deal with clogged sinks :)
Well, they also serve to hold your state while you add processing capacity.
If you look at a market queues, when they start to grow, management deviates people from non time sensitive tasks into cashiers. The queues make keeps things working during the change.
It's two sides of the same coin. In your example management only adds more cashiers if the queues have become unacceptably long (they've tripped a soft limit). The queues are serving to accommodate irregular shopper purchasing patterns because shoppers don't arrive consistently throughout the day. So really they are serving to smooth over bursts. Adding more cashiers is simply a response to queues hitting an unacceptably long length. If management had infinite cashiers and you could add them instantly then you wouldn't need queues. But, and this is especially obvious if you consider the scenario where all registers are staffed, then you still need queues to handle bursts. Costco is a good place to experience this.
I wonder which combination of HTTP reverse proxy (e.g. nginx, cloud load balancer) and application server (e.g. Python WSGI server) is best at using bounded queues and signaling backpressure (e.g. HTTP 503) rather than just letting requests pile up.
I'm no expert, but I don't think this is true. Couldn't tasks arrive into the queue at the same rate that they are processed, resulting in a fixed queue size at 100% utilization?
Put another way, in the second "No good" diagram showing one task being worked on with none in the queue, another task could arrive before the current one is finished.
I suspect the counterargument might be that the task arrival rate is not usually that predictable, but even so, I expect that the amount of variance predicts how true this theory is in practice.
> What’s to prevent the system from bouncing between “1 task in processor & 1 task in queue” and “1 task in processor & 2 tasks in queue” while maintaining 100% utilization?
> Nothing! That could totally happen in a queueing system. However, the arrival process would need to be tuned quite precisely to the processing rate. You would need to watch the status of the processor and, when a task finishes, only then insert a new task into the queue. But this implies a shell game: if you always have a task ready to put into the queue when it needs to be padded back out, then isn’t that task in a sense already “queued”?
> Instead, in queueing theory we usually assume a random arrival process: sometimes a minute may pass between arrivals into the queue; sometimes only a second. So the system can’t bounce for arbitrarily long between 1-in-queue and 2-in-queue states. Eventually, one of two things will randomly occur:
> 1. From the 1-in-queue state, the active task finishes processing before a new task arrives, bringing queue size to 0.
> 2. From the 2-in-queue state, a new task arrives in the queue before the active task finishes, causing the queue size to grow to 3.
Classical misunderstanding of queueing. If you have a perfect control over an arrival, you can consider it as already queued. I.e. bank customers or grocery store checkout customers or online http requests for a service all arrive at random interval. You don't have a control over their arrival timing.
This is not helped by the language used in the post. This was supposed to be an introduction to queuing, people shouldn’t come away with their classic misunderstandings completely intact.
> When a system has steady-state 100% utilization, its queue will grow to infinity. (Emph mine.)
This is clearly false, there are clearly counterexamples where the queue does not grow that way. You don’t have control over the arrival rate etc but it could just be your lucky day. It would be more accurate to say you can’t guarantee that either the queue length or wait time are finite. Author didn’t include the words “assume the worst case” or anything like that. I feel like that’s a pretty crucial piece of framing that’s missing.
The author didn't assume the worst case, he just didn't set the timeframe. Every queue with random arrive times that has processors operating at 100% of their capacity is guaranteed to eventually grow indefinitely (as in, you can name any size, it will become larger than it).
But on practice you also can not maintain 100% workload indefinitely either.
Random dice rolls don’t preclude rolling 1 again and again forever, right? There’s an assumption here about the distribution of arrival times that hasn’t been stated.
The model (M/M/1) is set up using the averages, not worst or best case. If you roll a die infinite times and get infinite 1s, then you have either got a loaded die (in which case its distribution is not the generally assumed uniform distribution from 1 to n, but just 1) or are extremely lucky, but the situation is still far from average.
> There's an assumption here about the distribution of arrival times that hasn't been stated.
Yes, it has, but not as explicitly as you may like. The author explicitly stated that they were considering an M/M/1 queue:
And in the "What about 90%?" section specifically calls out what distributions they're using (which are the same as the ones you'd expect from that page on M/M/1 queues).
For the system under discussion, the average arrival rate is the first parameter, and the average processing rate is the second parameter. If you decide to only consider the case where everyone shows up half as often as the average, then you've changed the distribution being considered. Similarly, if you decide to only consider the case where tasks are completed twice as fast, then you've, again, changed the distribution.
Queuing theory isn't generally about picking a specific circumstance and saying, "Look, you can get really lucky and clear the queue in 10 minutes instead of 10 years, even though on average we would never expect that." The primary motivation (in applying it) is to consider the average circumstances. On average, if your arrival rate in an M/M/1 queue is the same as or faster than your processing rate, your queue size will grow indefinitely. Sure, you could get lucky and have a really slow day for customers and a really on-their-game worker, it is statistically possible, but over an arbitrarily large period of time we'd expect the arrival time to be too high for the processing time, and the queue to grow.
Yes, I understand that, at least at a high level. I’m making a narrow point that the article didn’t teach us this and it actually was essential to understanding the claim. Quoting the article again:
> I won’t go into the M/M/1 part, as it doesn’t really end up affecting the Important Thing I’m telling you about. What does matter, however, is that ‘∞’.
This turned out to be wrong. It does end up affecting the Important Thing and the discussion was severely incomplete without it.
The steady state is like Big-O notation; it’s already talking about the edge (as time grows to infinity). A single day deviation doesn’t really matter — your queue might shrink and grow a little day-to-day, but on average it’ll grow more than it shrinks, and you’re ultimately fucked.
If your utilization looked like 100% on one day, 0% on another, and flipping back and forth (allowing the queue to clear) then by definition your steady state isn’t 100%, its more like 50%.
And of course if your time bound is limited, or your potential customers is limited, then naturally everything is limited. But like big-O, the notion is to describe what happens when your inputs grows to infinity
Yes, that is a very basic point, one basic reason to do queuing is that queues are buffers for uncertainty. If things are perfectly predictable, as you see on some robots, you allocate exactly what you need and no extra.
The uncertainty exists in two places though... It's not just the events coming in but also the time to process each queued event, is another source of uncertainty in the same system that can have the same effect.
The author mentions “M/M/1,” this quantifies that there is only one queue-handling process, that events come in with the sort of randomness that raindrops have (no periodicity), and that handling an event does not have any “timeouts” nor any “long tails” where once a handler has been occupied for 3 or 4 T (where T is the average time to resolve) then the prediction for how much time is remaining for them either drops to 0 (timeout) or escalates dramatically (long tail)... The exponential distribution is used, kind of to be nicely in between these, it is a model where over a time dt the handler is asked to pick a random number with success rate p=dt/T, if they pick the right random number then the event is handled, otherwise a new random number is generated, repeat ad nauseam.
The basic lessons are surprisingly transferable though. So for example traffic on the highway does not have this raindrop sort of randomness because cars kind of repel each other, they space out much more uniformly than you would have predicted on short term scales... The handling time is also nothing like this model of random guessing. Nevertheless, traffic jams still form at high utilization! And they form for a really interesting reason which is that the n handlers—lanes, we call them—have correlations between them. All these queuing models optimistically assume that a handler never stops another handler, “hey have you ever seen anything like this? where can I find that password?” etc.
But in traffic, a car can only lane change from one lane to another if both lanes synchronize speed. With a fixed ability to accelerate, this depends the amount of space between you and the next car, and the amount of space in the lane you want to switch to: as utilization increases both of these spaces drop and synchronization becomes a massive source of inefficiency.
Same happens at a company too, it is possible that an employee might spend more time waiting for meetings, and finding convenient times to schedule them, in order to synchronize two streams of work, than the work itself takes. Can be very frustrating!
It does not immediately grow towards infinity, but it is unbounded. Over a sufficiently long period of time, any given queue size will be observed.
Assuming the input and output are just a little bit random, and are only equal on average. You will have periods during which the input and output balances out. You will have periods during which the output is faster, and the queue will shrink to 0 (but no lower). And you will have periods where the output is slower, and the queue grows, with no bound.
This is wonderful and shows the problem very clearly.
If your system doesn't have a way to shed load, it will eventually overload.
The problem for many is that when it does finally shed load, the load shedder gets blamed (and turned off). See how many people look for ways to turn off the OOMKiller and how few look to figure out how to get more RAM.
> All of a sudden, the buffers, queues, whatever, can't deal with it anymore. You're in a critical state where you can see smoke rising from your servers, or if in the cloud, things are as bad as usual, but more!
I work in a custom fabrication environment, and this solution doesn't really apply, because what happens in a dynamic system is that the bottleneck shifts all around the shop dynamically. It's never just one operation that is the constant bottleneck.
How doesn't it? I've lived in a world of coordinated microservices before, where a nice linear progression of queues only happens at the atomic level; when you zoom out you've got a digraph full of cycles, nodes that come & go in response to various events & schedules, and the transitory nature of those services entering could cause a smooth-sailing pipeline to suddenly seize up because some downstream dependency is servicing higher-priority requests.
The only way we could stay sane was aggressively following Ferd's rule – nobody got to surface an API unless it surfaced HTTP 429s on overload[1], and if you called a downstream API without checking for and responding to a 429, tough noogies, your requests got dropped and you had to figure out what to do to fix it.
Funny enough, at the time I did software for a manufacturing facility that did high-throughput manufacturing for many varying products. I was lucky enough to do a "gemba walk" where they made us engineers put on hard-hats and eye protection and reminded us who actually made the company money, and by god those guys were doing exactly what we did: they had JIT delivery of raw materials / sub-components to assembly lines, and the delivery of resources to the line was always constrained by a fixed buffer, quite literally "how many units you could put on a shelf". You couldn't just toss more product at a line that wasn't ready for it, it would literally not fit.
Sure, they used different terms than we did – muda and muri and value-stream mapping – but the outcome was very similar. Their version of the NOC was a team of managers who watched the cycle time of each line, how long it took material to be transported, which areas were bottlenecking and which were at risk of running into resource starvation, and they had the ability to balance the hot spots by bursting product in, caching upstream components into hold docks, releasing held work product to downstream lines, or even turning on a flex line to work around longer-term constraints. They rarely had fixed, static bottlenecks, because they optimized the life out of them. Their bottlenecks arose when a shipment didn't arrive (and you better believe those were tracked just as aggressively), a machine broke down, an operator got sick, things like that.
But at the end of the day? Same principle: downstream components would only accept as much as they could process from upstream components; you asked before you pushed.
[1]: and if you said "I swear, you can't overload this service, we don't need to send 429s" a swarm of senior engineers descended upon you and only left until you convinced them they were wrong, or – way more likely – they convinced you.
Manufacturing and fabrication are two different worlds. Manufacturing is all about reducing/eliminating variance, whereas fabrication (high-mix, low volume) is all about absorbing variance. Most of the things we build have never been built before, and most of them will never be built again. So the estimates on how long it will take to build one is just that: an estimate. And the process of fabricating it is affected by MANY variables, but I'll just list a few. Materials with long lead times. Materials not arriving on time. Wrong materials sent. A pareto distribution of time it will take to finish an operation depending on which employee I give it to. What else is in the queue for an operation. Whether customers put a hold on the part. And so on, and so on.
Every manufacturing person I've met thinks we're just idiots. Then they come to our shop and it takes them about 1-2 years of trying to "we just need to implement Lean, guys!" before they realize that fabrication has very little, if anything, in common with manufacturing and the only lean principle that really applies is keeping your work area tidy and your tools organized.
Our business model is built around our ability to absorb and manage variance, rather than eliminate it.
I agree that software and fabrication have much more overlap with regard to challenges faced than software and manufacturing. Unfortunately, manufacturing gets most of the academic attention, so job shop scheduling has basically stagnated since the 70s, when academia started becoming aware of Toyota's innovations in manufacturing. Some work has been done using genetic algorithms to try to tackle the subject, but nothing substantive has taken hold in the industry.
That being said, I've found a few resources that help you start to get a handle on fabrication (and software development) and how to manage it. Theory of Constraints by Goldblatt, and it's more executive-friendly companion volume called The Goal are a good place to start, but lacking on detail.
Understanding Variation by Donald J. Wheeler has proven to be a very useful resource.
It's About Time: The Competitive Advantage of Quick Response Manufacturing by Rajan Suri is the executive companion to the book Quick Response Manufacturing: a Compan-Wide Approach to Reducing Lead Times by the same author. Mr. Suri uses a system-dynamics approach to manage capacity and to manage batch sizing, which his research revealed as the keys to optimal lead time reductions, and thereby overall throughput.
The other interesting thing we've found by analyzing the data of our company is that the time it takes to ship a particular piece (measured in days) falls into a long-fat-tail distribution regardless of the scale we analyze. In other words, if we look at one employee's output, or one team's output, or one shop's output, or the company as a whole, the time to ship is in a long-fat-tail distribution.
Because of this, we basically manage by how long something has been in a particular department. The longer it's been sitting, the higher its priority becomes. What we're trying to do is basically manage the exceptions and let the system manage those that are in the normal bell curve.
Hope this helps. Feel free to ask further questions. I find the topic fascinating and not many people share this interest.
After I finished my comment, a thought occurred to me that I wanted to share. One big difference between fabrication and (most) software projects is scope creep. We manage scope quite fiercely, and the weapon we wield is the change order. We are in the fortunate contractual position of being able to charge our customers when they change/add scope.
One of the biggest challenges I've seen (as someone outside the industry, but familiar with it) in software development is feature/scope creep. That seems to really hobble a lot of projects. Star Citizen comes immediately to mind as an outlying, but illustrative example.
We don't really have a lot of feature/scope creep in most of our projects. We do, however, have shared resources between simultaneous projects, which poses a similar challenge.
Yes, we have much more in common with software development than we do with manufacturing. We have more stages of production than software, and we have a challenging situation of shared resources between projects (we are generally working anywhere from 25-50 projects at once in our shop using shared resources), but the variance factor of 1) not knowing exactly how long it will take to make something because it's never been made before, and 2) different employees taking vastly different amounts of time to accomplish the same task are both shared challenges that we face along with software development.
Anything made from steel. Everything from small machined parts all the way up to the swing arms on electric mining shovels. We have some parts we build that require two 60 ton cranes to lift. Everything in between, mostly for the energy industry.
A lot of frameworks have queues bounded only by the available memory. As the article nicely demonstrates it implies that CPU must be idle at least 20% of time for that to work to have reasonable bounds on latency. With various GUI event queues that is OK, but it puzzled me why, for example, Erlang opted to have unbounded message queues. Did not Erlang designers care about latency?
In erlang it is standard to have 5 seconds timeout on queries. So, when many processes start to send to one process and it gets overwhelmed, other processes start having timeouts and typically they have some backoff strategy. Essentially queues are not bounded by size, but by time. Most processes don't send messages in a fire-and-forget manner, but wait for confirmation of execution.
Erlang designers certainly cared about latency, but a bounded message queue changes local messaging from reliable to unreliable and remote messaging mostly reliable (as long as the nodes don't disconnect) to unreliable. That's a big change and they were unwilling to do it for a long time, although I think there is an option to put bounds on the message queue now. It was possible to have bounds before --- either with brute force looking at the queue length for all processes and killing them, or looking at the queue length within the process and discarding requests if there are too many.
Also, for a long time, sending messages to a process with a long queue might be deprioritized (suspended), as a form of backpressure, although that got removed for local processes because it wasn't effective with SMP, if I'm remembering properly.
> but once you understand it, you’ll have deeper insight into the behavior not just of CPUs and database thread pools, but also grocery store checkout lines, ticket queues, highways – really just a mind-blowing collection of systems.
...and karaoke singer rotations. In 6 years of frequently singing karaoke at bars, I've never known a host to turn away new singers, unless it's almost time to end the show. So you could say the queue is unbounded, with predictable results for the few singers who show up early and then get frustrated with the ever-increasing time until their next turn as more singers arrive. I don't know what the best solution is.
That's the same load shedding as grocery stores use - if everything gets too crowded people start leaving (not queueing).
Now that may actually be suboptimal for the business (if they can only let 10 people in they'd rather let the 10 who will spend the most, say) which is why things like reservations, etc come into play. I wonder if restaurants that do both reservations and a queue see one group pay more on average ...
That's the same load shedding as grocery stores use - if everything gets too crowded people start leaving (not queueing).
Yes. That's called the "rejection rate".
Unless, some of the time, queue length is zero, you will have a nonzero rejection rate. This is worth bringing up with store managers who want to run understaffed checkouts. One of the things retail consultants do is point out how sales are being lost that way, both in customers who leave and customers who never come back.
Much of my early work on network congestion was based on that. In the early days of networking, everyone was thinking Poisson arrivals, where arrivals are unaffected by queue lengths. This is partly because the original analysis for the ARPANET, by Leonard Klienrock, was done that way. It was done that way because his PhD thesis was based on analyzing Western Union Plan 55-A, which handled telegrams. (Think of Plan 55-A as a network of Sendmail servers, but with queues made from paper tape punches feeding paper tape readers. The queue was a bin between punch and reader.)[1], at 7:00. Queue length was invisible to people sending telegrams, so senders did behave like Poisson arrivals. That's still true of email today.
The IP layer is open loop with rejection. Transport protocols such as TCP are closed loop systems. The two have to be considered together. Everybody gets this now, but it was a radical idea in 1985.[2]
This is a big part of self checkout - it reduces rejection rate.
And that can be done with queues in general, if alternate paths are available at some point intelligent operators will change path (this can be problematic also, when the option to change isn’t available - I’d normally like to connect to a server near me but if it is overloaded give me the option to take a further away one).
I've never managed a restaurant, but thinking about my own habits I'd bet that a restaurant that takes reservations but also some walk-ins almost surely makes more on the reservation guests. The meal is the main attraction for them, often a special occasion, while walk-ins are more likely to spread their money over other businesses (shopping before, drinks elsewhere after, etc.). I bet group size is also higher for reservations; most people know a reservation is all-but-required for larger groups.
Focus on the bottleneck. There is a finite amount of time. Increase it to allow more participants.
There's also putting restrictions on the input. Have a maximum song length.
Otherwise, require additional constraints before queueing like making it computationally expensive or having a priority system based on criteria. Charge a cover (with increasing costs) or have a membership club that gives more/better access.
If some logic re-orders a job-related queue in order to group similar things, enhancing caches' hit-ratio, then a quite large queue can be quite effective (especially if there is no priority nor barrier, however in such a case the high throughput cost is a high maximal latency).
Exactly. In fact I'd go so far as to say this is an equally Important Thing to the Important Thing in the article. If your queue is just a simple FIFO / LIFO / whatever to buffer surges in demand until you get to them, then you absolutely get the "queue suddenly goes to infinity as you hit a critical point near maximum throughput" thing happening.
If, however, you use the queue as a way to process queries more efficiently then that's a different story.
> If the processor is always busy, that means there’s never even an instant when a newly arriving task can be assigned immediately to the processor. That means that, whenever a task finishes processing, there must always be a task waiting in the queue; otherwise the processor would have an idle moment. And by induction, we can show that a system that’s forever at 100% utilization will exceed any finite queue size you care to pick:
This argument is incorrect because it proves too much. If tasks arrive predictably, say at the rate of one per minute, and the processor completes tasks at the same rate, you can have maximum throughput and a finite queue -- indeed, the queue never has to exceed two waiting tasks. The unbounded queue growth problem only arises when tasks arrive unpredictably. Since this argument proves the same conclusion without regard to predictability, it's incorrect.
Here's how I think it actually works: if tasks randomly come in and out at the same rate, the size of the queue over time can be modeled as an unbiased random walk with a floor at zero. Elementary techniques prove that such a random walk will hit zero in finite time, and will therefore hit zero infinitely many times. These correspond to the idle times of the processor. The only way to avoid hitting zero over and over is if tasks come in faster than they come out, which leads to tasks piling up over time.
I think you are putting too many fancy words into the problem.
Queues are balloons. If you put more air in than you are taking out, they grow until they pop. As utilisation grows (ie as the input volume of air starts to match the output) it takes much longer to get rid of the backlog.
This means that any traffic anomalies that would otherwise be hidden by the queue are instantly amplified.
If all you care about is engineering intuition, you can use any number of metaphors. I'm talking about what it takes to understand queues mathematically and prove correct results about them, rather than incorrect results.
> I won’t go into the M/M/1 part, as it doesn’t really end up affecting the Important Thing I’m telling you about. What does matter, however, is that ‘∞’.
I'm really no expert but if I understand it correctly the `M/M` part defines the distributions of the arrival and service processes, so this definitely is important as others have already mentioned in the comments. E.g. a D/D/1 queue where D stands for deterministic arrival shouldn't suffer from this problem.
This doesn't change the interesting fact the article presents but throwing in technical terms without proper explanation is imo bad style. Either don't mention it (would be better in this case) or explain it and why it is relevant.
This is also a common "mistake" unexperienced people make when writing papers. They mention a fact that is somehow related but is completely irrelevant to the topic and the rest of the paper. I don't want to assume anything but to me this often smells like showing-off.
This is unworkable if you are actively scaling your system. Am i supposed to calculate ideal queue size with each scale out of my data platform?
Instead, the right way to think about limiting queue size is load shedding when you feel back pressure.
Here’s an example at Cronitor: if our main sqs ingestion queue backs up, our pipeline will automatically move from streaming to micro-batching, drastically reducing the number of messages on the queue at the expense of slightly increased latency per message. At the same time, a less critical piece of our infrastructure pauses itself until the queue is healthy again, shedding one of the largest sources of db load and giving that capacity to ingestion.
To me the goal is to feel and respond to back pressure before blowing up and rejecting messages.
And what if items are processed at exactly the same rate they are added? Something is missing from this thesis. There must be an additional assumption at play beyond what is stated in the article.
The assumption you're looking for is there but relatively inconspicuous. He mentioned a M/M/1/∞ queue but then didn't go into the details about the M/M/1 part. From the Wikipedia page[0]:
> an M/M/1 queue represents the queue length in a system having a single server, where arrivals are determined by a Poisson process...
So the arrival times are not arriving at any fixed rate, but according to a Poisson distribution. In the article he did reference this fact right before the first plot:
> (For the wonks: I used a Poisson arrival process and exponentially distributed processing times)
And then finally, he answered this question more directly in the comments to the article:
> What’s to prevent the system from bouncing between “1 task in processor & 1 task in queue” and “1 task in processor & 2 tasks in queue” while maintaining 100% utilization?
> Nothing! That could totally happen in a queueing system. However, the arrival process would need to be tuned quite precisely to the processing rate. You would need to watch the status of the processor and, when a task finishes, only then insert a new task into the queue. But this implies a shell game: if you always have a task ready to put into the queue when it needs to be padded back out, then isn’t that task in a sense already “queued”?
> Instead, in queueing theory we usually assume a random arrival process: sometimes a minute may pass between arrivals into the queue; sometimes only a second. So the system can’t bounce for arbitrarily long between 1-in-queue and 2-in-queue states. Eventually, one of two things will randomly occur:
> 1. From the 1-in-queue state, the active task finishes processing before a new task arrives, bringing queue size to 0.
> 2. From the 2-in-queue state, a new task arrives in the queue before the active task finishes, causing the queue size to grow to 3.
Of course I know what was missing. My comment was more about making the implicit more explicit. I am not the only one who immediately thought of the above as a counter-example to the bold text (repeated twice) in the article.
It's not just the arrival rate that is a poisson distribution, but it's also the processing time being a poisson distribution that really sets the table for the scenario the article is addressing. For me, it's easier to understand in a grocery store situation, where customers arrive in a poisson distribution at the checkout line, and that each cart has a poisson distribution on how long it takes to process the items in the cart. That second level of variance is key to the queue buildup, especially as the grocer checker reaches 100% utilization.
Theoretically I think that’s possible. In a practical setting it is not since task arrival times are of a probability distribution; this means that utilization is not 100% and therefore queues do not grow to infinity. Given the author’s point about ~80% and above being bad, I imagine the immediate processing scenario to still be pathological.
Perfect time to ask, I've been toying with using the netflix concurrency limits library for a while as opposed to the manual tuning of threadpools, queue depths, etc to achieve good utilization at certain latency. Curious if others have experience with it, and their thoughts:
https://github.com/Netflix/concurrency-limits
What the naysayers are missing is the definition of load; read it again: "[C]apacity is the number of tasks per unit time that can be processed."
But tasks are variable. One task requires 5 minutes of processing, another one of 35. They arrive randomly. This is also explicitly given, in a caption under one of the diagrams. "A queueing system with a single queue and a single processor. Tasks (yellow circles) arrive at random intervals and take different amounts of time to process."
People calling the article wrong may be thinking of capacity as "unit time amount of work"; like when the processor is at full capacity, it's doing one second's worth of work every second.
If we define capacity this way, the problem goes away: the processor just becomes a leaky bucket. So that is to say, if we know exactly how long each task will take, then we can measure the queue size in terms of total number of seconds of work in the queue. And so then, as long as no more than one second's worth of work is being added to the queue per second, it will not grow without bound, just like a leaky bucket that is not being refilled faster than its leak rate.
When capacity is given as a maximum number of tasks per second, there has to be some underlying justification for that, like there is some fixed part to servicing a job such as set up time and clean up time that doesn't go away even if the job takes next to zero seconds, such that jobs effectively have a built-in minimum duration. If it takes one second to set up a job, and one second to clean up, then the maximum capacity is half a job per second: 1800 jobs per hour and so on. Of course the queue starts to backlog when you approach capacity, because the jobs also require nonzero real work in relation to the administrative time.
If jobs have no minimum fixed cost attached, then the maximum job rate is unbounded: the shorter the jobs being queued, the more of them can be done per unit time: one million one-microsecond jobs can be done in a second, or a billion one-nanosecond jobs, and so on.
If you are interested in modelling microservices, service busses, and pubsub topologies, you can do some rough models of queues in python with this: queueing-tool.readthedocs.io/
What are the underlying assumptions we can break here?
For example, what if tasks were not monolithic? As the queue size grows, increase the caching timeout and don't hit that DB, or decrease timeout, or something like that.
Or, what if the input task variance was bounded and we just initialized the queue with 10 tasks? This way, the addition of tasks would never dip below 1 and would never exceed 20 (for example).
Processing capacity might not be constant. If you're in a cloud, maybe you can launch more processing power as queue length increases.
Multiple items in the queue might be satisfiable by the same processing, e.g. multiple requests for the same item. In that case, having more requests in queue can increase processing efficiency.
Edit: another one. All requests may not need the same quality of service. For some, best effort might be acceptable.
I wonder if it might also be possible to introduce a kind of tiered storage, where, for example, the queue starts persisting to cheap, massive block storage (such as cloud block storage) instead of it's usual mechanism. That does imply tier 2 events would read slower when the busy period ceased though.
> It has a completely opposite pattern: the more you load it the faster it goes.
Not sure what do you mean with that. The article is about the general theoretical properties of (unbounded) queues.
LMAX is a bounded queue, with the quirk that it will drop older messages in favour of newer ones and it assumes that either there is a side channel for recovery or consumer(s) can tolerate message drops.
You're right, that was a rough take. My presentation of LMAX was to simply provide a different perspective over this space.
What I was trying to get at is the specific memory access patterns, batching effects and other nuances related to the physical construction of the CPU which modulate the actual performance of these things. I think this quote better summarizes what I was trying to convey:
> When consumers are waiting on an advancing cursor sequence in the ring buffer an interesting opportunity arises that is not possible with queues. If the consumer finds the ring buffer cursor has advanced a number of steps since it last checked it can process up to that sequence without getting involved in the concurrency mechanisms. This results in the lagging consumer quickly regaining pace with the producers when the producers burst ahead thus balancing the system. This type of batching increases throughput while reducing and smoothing latency at the same time. Based on our observations, this effect results in a close to constant time for latency regardless of load, up until the memory sub-system is saturated, and then the profile is linear following Little’s Law. This is very different to the “J” curve effect on latency we have observed with queues as load increases.
Shape Up [1] (from basecamp) rejects the notion of backlogs essentially for this reason. We've leaned into adopting it without using basecamp software specifically and it's pretty refreshing.
What do you do when the queue overflows? Block work on backlog capacity? Drop the new item on the floor?
The problem is that people add things to the backlog during their ordinary work, so as you scale up the number of workers, the pressure on the queue increases.
That said, my experience in doing bug triage is that some large percentage of tickets are never going to be done. Those should just be dropped on the floor.
Yeah you drop the old ones. And then the insight is that nobody ever has time to process the backlog anyway so just drop the new ones instead and delete the backlog. If the work is important, it will come back up as a new request during cycle planning.
A large list of books (free and online), I cannot speak to the quality of any particular book but you can take some of the titles and authors and perhaps find reviews.
Linked at the top of the first page I shared, but not available online (though a few are available through some services like O'Reilly's digital library).
To enumerate your viewpoint: if there are five checkout lines at the grocery store, your odds of picking the fastest line are 1/5 (20%). The more grocery store lines there are, the lower the odds are of you picking the fastest line.
They added "express lanes" back in the 80s to somewhat address this, but then you have the person who ignores the sign and gets in the line with a full cart.
I just finished implementing a simple backpressure algorithm. Once the queue is more than a certain percentage full, it rejects new items. These rejections get turned into HTTP 503s.
What is the purpose of cutting off the queue at a n% full? Isn't that effectively saying that the queue size is actually n% of the stated size, or am I missing detail?
Perhaps "rejects new items" is the key - a web server that hits a certain percentage utilization only allow existing clients/connections/logged in users/whatever to connect; refuse new connections.
Haven't you just moved the pressure back a step? Now you have a bunch of items retrying to get their stuff on the queue and a bunch of places to try and locate the problem, instead of an overly large queue. Is the idea that the messages aren't important enough to be placed on the queue in the first place if demand is high?
The idea is that it's not a sudden hard failure. Instead only a few things fail which helps in two ways: reducing the load by discarding some work and alerting that the system is beginning to be overloaded.
I think 429 is the preferred status code for this. Of course you can use whatever you want, but 503 can presumably come up in other scenarios and it's nice to have something unambiguous.
Messaging middleware such as queues is mostly redundant. Simplistically -
Client --
If you cache a new submission (message) locally before submitting, you just keep re-submitting until the server returns an ACK for that submission; and your local cache is empty.
Scale clients out as needed.
Server --
If a message is received in good order, return an ACK.
If a message is a duplicate, discard the message and return an ACK.
If a message can only be received once, discard if it already exists on the server, and return an ACK.
If a message is invalid, return a FAIL.
Scale hardware out or up, as needed (ref. "capacity, item 2 in the linked blog post above). Scale queues out by adding service endpoints on the server. Async/await makes the client experience painless. You save your employer $$ because no additional procurement contracts, no licensing fees, no additional server-side infrastructure to run queues, and no consultancy fees to set up and operate Rabbit MQ/Tibco/IBM MQ/Amazon SQS/Azure Queue Storage or whatever other queueing solution the org uses.
Message passing includes concepts auch as durability, security policies, message filtering,delivery policies, routing policies, batching, and so on. The above can support all of that and, if your scenario calls for it, none of it.
The financial argument reduces to dev time vs. procurement, deployment and operational costs of whatever 3rd party product is used, as well as integration, of course.
* I note and welcome the downvotes. However I'd love it more if you'd present a coherent counter argument with your vote.
> If you cache a new submission (message) locally before submitting, you just keep re-submitting until the server returns an ACK for that submission; and your local cache is empty.
This has terrible resource use and offers no visibility into how many clients are waiting. And yet it's still a queue. Why would anyone do that?
I think you have defined "queue" too narrowly in the context of the OP article. MQs are one thing, but the article is about queues as data structures.
A directory of files to be processed may be treated as a queue. Add several directories for different stages of processing, and you have a rudimentary state machine.
Distributed systems in particular may benefit from an MQ, but are by no means necessary.
Generally, when we add an MQ we are really regulating an already existing implicit queue. It's such a common and intuitive data structure that one may easily create one without even realizing it.
It's annoying but also understandable how often people just dump stuff into an unbounded queue and punt on making sure things work until the system is falling down. Often queues are more of a tool developers and operators can use to navigate unknown performance characteristics and scale later than they are a requirement for the actual system itself.