The way I see it is that there are two opposing things you can optimise for that depend on queue depth: utilisation or latency.
If you care about processing each message as quickly as possible then queues should be empty. This often requires a higher cost and lower utilisation as you inevitably have idle workers waiting for new messages.
If you care about utilisation, then you never want your queue to be empty while something is polling it. This might be some background task that runs on a GPU - every second that sits idle is wasted cash, and those tasks usually benefit heavily from batching inputs together.
In this case you want it to always have something to read from the queue and shut it down the moment this isn’t the case.
> If a queue isn't approaching empty, then its input is exceeding its output. Therefore it's size is trending to infinity.
This is literally the slippery slope fallacy. You aren't accounting for minor fluctuations or finite timescales. If you were driving a car, you might say "if you aren't steering into oncoming traffic then you're steering towards the ditch" and conclude that no cars will ever safely reach their destination.
If the only valid state for a queue is empty, then why waste time implementing a queue?
It assumes the velocity of items being added to the queue never changes. In real life it speeds up (approaching infinite items), and it slows down (approaching empty). A more meaningful measurement would be something like average number of items in the queue and average time in the queue for any given item.
If you feel like you have enough spare capacity and any given item isn’t taking too long to process then it doesn’t matter if you are ever empty.
You're on the right track, but I personally find averages to be not a useful measure...on average.
Corner conditions are where I start to care:
* What is the worst case latency? How long until the work can be done? How long until the work can finish from the point at which it enters the queue?
* What is the worst case number of items in the queue? How large does the queue need to be?
* What is the maximum utilization over some time period with some granularity? How spiky is the workload? A queue is a rate-matching tool. If your input rate exactly equals your output rate at all times, you don't need a queue.
* What is the minimum utilization over some time period? Another view on how spiky the workload is.
Minimums and maximums I find are much more illustrative than averages or even mean-squared-errors. Minimums and maximums bound performance; an average doesn't tell you where your boundary conditions are.
In general you don't want your queue to fully fill up, but like the other poster said, it's some tradeoff between utilization and latency, and the two are diametrically opposed.
Sure, I guess by average I just meant shorthand for some sort of measurement for making decisions about resources allocated to your queues. Median, mean, P99, whatever is useful.
We use queues for various jobs, some of which are time-sensitive (e.g. emails), so the queue should be empty most of the time, some of which are background or re-processing (e.g. re-indexing), so queues can balloon and then take days to drain again.
The unifying aspects of using queues for us is that it allows us to load-balance jobs across workers, and allows us to monitor throughput in a centralised fashion. We can set alerts on the time-sensitive queues and then use different thresholds for the background queues, but we're using the same metric data and same alerting system.
M/M/1 Queue theory doesn't apply to real queues... Because the work input into the queue usually does depend on the latency of previous work completed by the worker.
If there is a 1 hour queue for the checkout at a grocery store, you probably won't join the queue - you'll go to a different store...
The Little’s law[1] offers considerable guidance here. In the shortish term it’s safe to model the system as being stationary and then you need to provision for seasonality[2]. Upward or downward trending demand after correcting for seasonality indicates that the system isn’t stationary and your choices are either to scale up or reject requests. Nevertheless the optimal queue size is not necessarily zero except in cases where extremely low latency is necessary. The optimal queue capacity for most web service type applications is one that trends toward almost but not quite full when seasonal requests are higher.
> Since we don't have infinite space, we can expect eventually to lose some messages in this scenario.
Sure, that's technically correct but applies to basically everything. It is very likely your users/orders/whatever table in your database is also "trending to infinity" over time, except this doesn't mean databases should always be empty or that we should expect to eventually lose some users/orders/whatever.
Or, more succinctly, if your queue capacity is 10 million messages and your queue messages represent "houses purchased through our website", then in reality your queue capacity is infinite because nobody will ever purchase 10 million homes through your application per small unit of time.
Your response is correct, but there's a more fundamental flaw in the preceding comment: although both paragraphs are true in isolation, the implication between them is false. "The queue is _currently_ trending to infinity" does not imply "The queue will _continue_ to monotonically trend to infinity". Periodicity is commonplace in real-life situations; and while parts of Queueing Theory produces some insightful and helpful insights by ignoring it, care should be taken when applying them.
If you're just making people wait 12 hours, or five minutes, or whatever - you aren't saving on compute. You're just adding a buffer. Maybe that is what you want, but it should be intentional.
This is usually the approach of people with fixed compute. If I have a VPS then I'll let it churn through the backlog overnight.
But this is also where the disagreement comes in with flexible serverless workers. They generally do not cost more, so you'll end up spending $24 in an hour rather than $1 an hour all day. If the serverless costs more then you are likely not working through your backlog.
I agree it should be a business decision, so you catch up on weekends instead for example, but non-technical folks don't always grasp those numbers and that they are getting a lag with their data.
It is a little counterintuitive, but "churn through overnight" still means you're decreasing the queue size. You're right that the cost really ends up being roughly the same, unless your service gives you a discount for compute overnight or something.
Completely agree that the queue still goes down, I just see comments on here about folks preferring a VPS because they run it 24/7 for the same price. That is probably fine for their scenario but the non-technical members probably don't understand that they aren't saving money by doing it this method but are adding a delay. That is my only point.
Exactly. I think a lot of people in this thread don't understand they "aren't saving money by doing it this method but are adding a delay" with their complicated auto scaling worker setup lol
> If a queue isn't approaching empty, then its input is exceeding its output.
It depends what you mean by "approaching empty". If you mean "empty most of the time", then no, that's wasted resources. If you mean "its size is decreasing most of the time", then no, not possible. Even if you mean "it is empty some of the time", then maybe, but that is not a strong requirement either.
Of course more average input than average output is bad, but stating it as "should be empty" or "should be approaching empty" seem to suggest the wrong thing entirely. It should be smaller than the target latency of the process.
Eh? It is possible for size to be decreasing most of the time. Just make traffic extremely bursty - add 60 on the hour every hour, and consume at one per minute.
What does average mean there? Increasing and decreasing are punctual events. If you mean the 1-minute average can be negative most of the time, that is true, but it would not work for other time horizons, whether larger or shorter.
the # of items in a queue is either growing or shrinking, there isn't the concept of "staying the same" and as you approach the line from shrinking to growing, performance (queue time) degrades after roughly 70-75% utilization.
You can only assert that it's "shrinking" or "growing" over a given period of time. It may have been shrinking over the last 5min while at the same time it has grown over the last hour. For a time window small enough, it has not changed, because getting 1 input or 1 output is a punctual event. For a time window large enough, it has necessarily grown, if it started empty.
So I disagree entirely with "it's either shrinking or growing". It is both growing, shrinking, and stable, at the same time, over different horizons, and I have no idea what law you are trying to state.
The link you provided above is a really good primer based on serious math, I don't understand how you think it supports your vague claims.
I think you should read up on this a little bit as your assumptions are not totally correct, and if you're using queues understanding this theory is pretty important and will save you some headaches.
Queues are either growing (adding items) or getting smaller (consuming items). There is not a state where they stay the same length, unless you're doing something really weird like monitoring to make sure there are always 10 items in the queue, or something, and adding if that number drops, but then you'd need a second queue-type structure to support that, and the conversation kind of goes off the rails.
It's a point in time, but it is irrelevant to the problem we're discussing and I'm not sure what value there is in discussing that static points in time exist. It's like discussing a runtime and saying "well, at this static point in time it takes zero seconds to run" - I just can't wrap my head around how it is relevant to the issue at hand.
You're also being really rude for no good reason. My polite suggestion is that you reflect on yourself here about why you are being rude on the internet to strangers. It could lead to growth and hopefully help you with whatever hurt you're experiencing.
This is true only when you have relatively steady flows, or infinite multiprocessing. In that case, yes, the queue is in practise either "almost empty" or "infinite".
I think the reason you get the responses you do is people are used to systems with high variation in flow rates that can only process a limited number of items at the same time, and for these, queues can soak up a lot of variability by growing long and then being worked off quickly again.
What about a queue that is the ingress point for a fan-in between O(N) active continuous producers to a single active continuous consumer?
In such a case, you'd expect at least N messages in the queue at all times — and often a small multiple of N. Not because the consumer can't consume those messages in a timely manner, but because, given a reliable queue, messages can't be consumed until they've been 2PCed into the queue, and that means that there are always going to be some messages sitting around in the "acknowledged to the producer, but not yet available to the consumer" state. Which still takes up resources of the queueing system, and so should still be modelled as the messages being in the queue.
I think this is a valid concern for Future Us to worry about. If you have alerts for when this is going to happen, you can punt the issue until it starts to become a concern and who knows if your product will be alive when that issue happens.
Indeed, and this is what I meant but worded poorly.
Obviously you don’t want it to be 100% full, but you do want it to be the kind of full you get after a nice meal but you still have room for desert and a coffee.
You really don't want your queue more than 70-80% utilized. Once it gets to 75% you should consider that "red zone" and you need to fix it somehow. Once the queue is 80% you're really hitting into some issues. Above 80% and you will have significant problems.
Thanks for this, that has helped me see something I missed.
If you’re dealing with something like SQS and are receiving variable sized batches of messages, the consumers can be classed as “100% utilised” (always processing a batch of messages with no empty receives) whilst the queue would not be considered 100% utilised (the average delivered batch size is always less than the max).
What issue are you trying to solve? The expense of polling?
Obviously there is a line where you do not want a bunch of expensive consumers hanging around - but this isn't the place IMO to micro-optimize your costs. You shouldn't have like 20x the consumers you need but you shouldn't worry about having EXACTLY the correct amount.
In your example, are you implying that if more jobs come in more workers get spawned? Is that specific to the Amazon service?
If a process is busy processing a task, it is busy.
A process could receive N tasks in a batch and process them at the same time.
In this scenario it is possible for all workers to be utilised whilst the queue is not “100% utilised” - which is a failure mode the link you shared explained:
> And by induction, we can show that a system that’s forever at 100% utilization will exceed any finite queue size you care to pick
The inflows to the queue are less than the outflows, despite every worker spending 100% of its time processing tasks.
It is specific on the utilisation type though, I’m mainly thinking about ML inference that has a fairly fixed initial cost per inference but scales quite nicely with >1 input.
You can't have 100% utilization of any processing capacity (ignore queues) without reaching this type of failure. 100% CPU utilization is essentially just meaning you filled up the queue on your worker server.
> The inflows to the queue are less than the outflows, despite every worker spending 100% of its time processing tasks.
Right, but at some point they will have to have SOME idle time. If they are NEVER idle the queue by definition is growing, not shrinking, as you said.
Unless you have some system that monitors queue length and doesn't add jobs if it is like >n, which is what the article suggests (maximum queue size). But then you have to figure out what you do with those jobs that can't be added, which probably means just creating some kind of queue higher up in the pipe.
If they are processing more than is being added, eventually things will run out of the queue.
that's a great paper that clearly explains how "back pressure" [my term] at high utilization causes latency to go hockey-stick. (And thanks for the link bc I apparently forgot to bookmark it when I initially read it.)
Not all queues serve (soft) real time requests and not all queues are bounded (1). If the rate of incoming queue items is predictable but not constant (for example, average load in the weekend or at night tends to be lower for many systems) you can often get away with having enough capacity to serve the average load while using the queue as a buffer. This is fine if the work done is not time critical, like sending most types of marketing emails.
You might argue that a queue with many items in it which can still accept more items is not really "full", but it is clearly not empty either.
(1) Obviously all queues are bounded by the capacity of the hardware, but since most queueing software can persist to disk and/or lives in the cloud with elastic storage this is not usually a problem.
The difference is that the requests that are being served are getting extreme latency. You are serving the same number of requests, but every single one is X days old.
No, there are use cases where new entries are not ingested by requests - sometimes you know all work that needs to be done upfront. And sometimes you have a situation where you have a few requests coming in over time but that you at the same time need to sometimes go through everything you have seen earlier, once more and flush it through the queue.
Queue isn’t full. It isn’t empty and things are being processed not in “real” time. I’ve always run my queues where they are never empty. If queue depth hits certain levels, workers are scaled up. When it drops, they are scaled down. Last thing I want is under utilized instances.
Unsure what you're saying. If your queue is NEVER empty wait times will grow to infinity. You shouldn't aim for this. You should aim for something reasonable like, "empty 50% of the time" etc..
They haven’t been empty for two years and my waiting times average 5 minutes and 30 minutes during high load times. I don’t understand how they will grow to infinity.
> I don’t understand how they will grow to infinity.
You either have to be consuming faster than the queue is growing, or you are adding faster than you consume. There is no equilibrium.
If you are consuming faster than it is growing, at some point you'll consume all jobs, meaning your queue is sometimes empty. If you are adding faster than you consume, your wait time approaches infinity (tho in reality you'll fill up your queue storage first, of course).
That is only the case if the consumption rate is fixed. If the rate of consumption is adjusted in a feedback loop based on queue depth and/or latency then clearly you can keep your queues in a balance of being neither empty nor infinite.
No, it still applies with variable computing. All you're doing is adding a N length or X second buffer to your queue. You are just shifting your "zero" over by N items or X seconds, which is the time you wait until you scale up.
You may need that or want to add this buffer - but you're not solving any computation resource problem directly. If your jobs take XYZ CPU cycles they take XYZ CPU cycles - it doesn't matter if you make them wait a little bit.
You could even be spending more money, since it does take some resources and cost to scale things.
It doesn't mean it's not valid to do things this way, I'm just saying it's still the exact same problem.
Your comment only shows that if you value resource utilization, you shouldn't use queues.
Because, no, nothing you said changes the fact that queues only do anything when kept mostly empty. And the only thing they do is some short term trade of latency for reliability. They actually can't change utilization.
I’m currently waiting for a flight. Isn’t airport is a kind of queue that buffers people over time then delivers them to the sky at steady, fixed intervals?
This airport seems busy but the planes seem full, how can we improve utilisation?
Bigger aircraft I guess. But the point is that we are at full utilisation despite the “queue” (airport) never being empty, and thousands of people awaiting “delivery”.
Which refutes the original point about queues not doing much unless they are empty and being unable to effect utilisation. If you have a somewhat inelastic output that benefits from batching, but has variable inputs, they are absolutely crucial to utilisation.
I'm being nitpicky but my intended prompt was to get you to define what needs to be improved, which you did. You need to improve (or increase) capacity, not utilization. The two are related, but they aren't the same. Utilization being at 100% means it only has two ways to go: Stay at 100% or drop below 100%. Which one constitutes an improvement? It depends, but for the airline utilization at 100% is already what they want (by the numbers, should be their maximum profit) so it can't be improved. Only capacity can be improved.
This has always been a thing, far far before the “latest in IaaS architecture”. The idle polling still exists except they manage it for you. Their margin is then improved if they intelligently reduce wasted polls through a variety of techniques.
However, yes, you don’t pay for this or care about it much. It’s good.
Seems like there's some implicit assumption in this comment that everything is rented by the minute and latency never matters. Renting things by the minute isn't really the cheapest way to buy lots of things though. Also, its sometimes nice to respond to the content of the article instead of its title.
I think the implicit assumption in the comment is that it would be possible for readers to understand the underlying idea of the comment rather than only see the specific example. “resource cost per minute” is not the only thing helped by a non-empty queue.
Also please don’t accuse someone of not “responding to the content of the article” when the content is a bog standard list. If you want my response to that: “yes, that is indeed a list of 4 things queues help with”. Not sure it adds much to the discussion.
When will my users be happier because I took longer to do things on their behalf? I apologize for not understanding the unwritten portion of the comment.
I guess I should really ask, why do people care about utilization instead of throughput, latency, and cost? We could increase utilization by rewriting everything to be more wasteful, for example, but it doesn't seem like we should do it.
Utilization is choosing throughout and cost over latency.
People care about utilization when, eg, processing petabytes of data — something that takes hours to days, where you’re optimizing batch throughout and cost. By making sure you efficiently utilize your hardware.
> Utilization is choosing throughout and cost over latency.
For a given rate of incoming work, increasing throughput decreases utilization. So I guess it's not choosing throughput too much.
> People care about utilization when, eg, processing petabytes of data — something that takes hours to days, where you’re optimizing batch throughout and cost. By making sure you efficiently utilize your hardware.
If you have fixed resources and lots of work, it would be good to have nonempty queues to prevent them from idling. That is maybe the big unwritten part of the comment upthread. It doesn't seem worth saying "no, the article is wrong in all cases because of this particular case!" though, so I'm not sure.
If you define throughput as "processed items over time per processor", the tradeoff becomes clear. Given the rate of incoming requests, you can get more processors, which will increase cost and reduce latency and throughput-per-processor, or you can get fewer processors, which will decrease cost and increase latency and throughput-per-processor.
If you have fixed resources and fixed work, wanting your queues to be empty is a wish, not a strategy.
Any queue that fills up, say halfway, is an indication the system is in a precarious state.
Its simple. Queues have limited space if the velocity of data going in is faster then data going out the queue will fill up until it's full.
A half full queue indicates the velocity is in such a state and such a state is unsustainable. You need to build your system such that it stays out of unsustainable states as much as possible.
If you're finding that your queue fills up half way a lot that means you're building your system in such a way that it operates at the border of sustainability. In other words the system often ingests data at rates that cannot be sustained indefinitely.
If a system with a queue half full operates in that same state twice as long, you now have a full queue and you're losing data.
In essence your queues should mostly be empty most of the time. At best you can account for occasional spikes as the article stated but a queue even half full should never be the operational norm.
This response lacks nuance and focuses too much on the operational side of queues.
Not all queues are “realistically bounded”. Let’s take a hypothetical silly “order queue” where each entry represents a €10 item that’s been purchased. The queue feeds into a boxing machine that puts the item in a box and posts it to the user.
The point at which a queue hits an upper bound would represent billions of euros of orders. That’s not really a realistic failure mode to hit without noticing long beforehand.
Meanwhile, if you always want your queue empty then your boxing machine is going to be shipping thousands of small boxes because the moment an order arrives, it’s processed into a box.
It’s much more efficient to have a small, managed and time-bounded backlog of orders that the boxing machine uses to batch orders together in a smaller number of larger boxes.
Wrong. Your thinking is where the nuance is lacking. As I said you need to think of the system in terms of it's derivative. The rate of data consumed must on average be faster then the production. The net difference of these rates must equal a negative velocity with the rate of data going in as a positive number and data consumed as negative.
There is no system on the face of the earth that can operate at an average positive velocity. That is categorical failure.
This is mathematically and logically true no matter how big your queue is.
Even for the situation your describing... A really big ass queue, the average velocity of that system must still be negative. Unless your queue is so big that it can withstand an average negative velocity over the lifetime of the system which is kind of absurd.
Most of the time your queues should be empty. If your queues are observationally often filled with data but not full it means you're still operating at a negative velocity but you've tuned your system to operate at the border of failure.
It means your system hits unsustainable velocities often. You may want this for cost reasons but but such a system is not robust enough for my comfort. The queue should be empty most of the time with occasional spikes at most.
This is a very long-winded way of saying “you should have enough capacity to process all items in the queue before the queue fills up and everything explodes”.
I agree. But that’s not the argument the post is making, nor does it support the position that “queues should be empty”. No, they shouldn’t. It’s fine to let a queue build up (and thus not be empty), then process all the items in the queue in a short time frame. There are efficiency savings if you are able to do this.
No. That is not what I'm saying. I'm saying on average queues must be empty most of the time. And if you are seeing queues that are filled most of the time, your entire system is operating on the border of failure.
This is true even when you have a big ass queue. The system must on average be more empty then it is filled.
This is not to say that your system has to never have spikes in data. Your system can often have spikes, but this cannot ever be the norm as in the amount of spikes in data production rate cannot push the average production rate to exceed the average consumption rate.
I don’t see how this holds true. Let’s say I have a process that takes 1 hour to process a batch of items up to 1000 (I.e doesn’t scale linearly with input size). So, 1 hour to process 1 item and the same 1 hour to process 1000 items.
If I have 1 message per minute being added to the queue, I can safely run 2 of these batch processes once a day to clear the queue. Or more generally can launch (queue_size / 1000) tasks once a day.
Because surely it doesn’t matter about the queue being empty, what matters is the rate at which messages can be processed.
With 2 processes running once a day the rate would be 2000/24/60 = 1.3 messages per minute in aggregate.
The queue can be filled 23 hours per day without issue and without being on the brink of a failure state.
Edit: you edited your comment whilst I was replying to specifically mention consumption rate. The aggregate consumption rate is all that matters, but it doesn’t follow that the queue “has to be more empty than not empty”. Those are different issues.
You're talking about a different scenario. I'm talking about continuous inflow and outflow. Where outflow is processing (fixed rate) and inflow is traffic (variable). This is the classic assumption. In this state the space used in the queue is an indicator of mismatched consumption and production velocities.
You're talking about a single batch job that runs once a day and empties the queue at the same rate no matter how much data is in the queue. Thus the more data you grab in batch the more efficient the processing so you grab data once per day instead of continuosly. This is a clear exception from the assumed norm that needs to be mentioned.
That is not to say that your case is invalid. Many analytic databases possess your ingestion parameters. But the case is exceptional enough that if not explicitly mentioned it can be assumed that such a batch job is not a primitive in the model being discussed.
I disagree it’s exceptional even if my examples have been caricatures.
Very few technical problems don’t have any elasticity in outflows, but a lot of problems do benefit from a form of batching. Heck, even something as simple as inserting records into a non-analytic database benefits greatly from batching (inserting N rows rather than 1) and thus reducing TPS, or calling an external service, or writing files to some storage, or running inferences, or writing to a ledger, or anything.
Batching, in all forms, exists absolutely everywhere from super-scalar CPU uops to storage to networking to the physical world. The exception is to do something without batching, even if the whole stack you’re using from the ground up is built on it.
Given the common use case of batching, which inherently requires items to be queued for a period of time, how can you say that queues should always be empty? They would statistically rarely be empty. Which is a good thing. Because it enables batching.
If you want to queue and insert 1000 records into a database with 1000 transactions across 1000 workers, then say your workload is highly optimised because there are never any records in your queue, cool. Not very true though.
>Very few technical problems don’t have any elasticity in outflows,
They have elasticity but you don't care for it. You run it at the maximum. As you mentioned the only case where you wouldn't is if you have some transaction that doesn't scale linearly with with frames/events.
>Given the common use case of batching, which inherently requires items to be queued for a period of time, how can you say that queues should always be empty? They would statistically rarely be empty. Which is a good thing. Because it enables batching.
In my experience, while batching is not uncommon, streaming is the much more common use case. So we can agree to disagree here.
It's not really that exceptional. You can do similar things with e.g. batch flushes every 5 ms for synchronous web requests and the same ideas will apply. This can give a massive increase in throughput while possibly decreasing overall response time for OLTP.
It's common, but exceptional enough that it should be mentioned because it is a specialized database that does batch ingestion.
Let's say you flew from one city to another. People assume you took a plane. If you took a helicopter, though helicopters are common, it should be mentioned otherwise it won't be assumed.
It doesn't take a special database though. You can do this sort of thing in your application in like 10-20 lines of code (assuming you have a framework with queues and promises). It works great with mysql/postgresql which both benefit dramatically from batching queries.
The context here is external queues like kafka. If you're writing stuff to a queue in your web app then it must be a relatively small load. This is different.
For analytics databases the batch ingestion requirement are usually so big that you have to save it to the file system hence the need for an external queue.
Heap allocated queues are different, at that stage I would say that it's in "processing" already and popped out of the external queue.
I guess "relatively small load" is... relative, but I've written that kind of thing to be able to handle ~70k sustained OLTP requests per second (including persistence to postgres) when load testing locally on my laptop. In any case the same thing applies to external queues. Your workers will often be more efficient if you pull off chunks of work to process together.
By small load I mean size. The frames in your batch jobs are tiny and few in number if you're getting 70k batch jobs per second.
It's more similar to streaming... what you're doing here. In that case my velocity measurements are more applicable. You want your queues to be empty in general. A batch job is run every hour or something like that which is not what you're doing here.
If you ran your load test for 5 minutes and you see your queues are 50 percent full. Well that means 10 minutes in you'll hit OOM. Assuming your load tests are at a constant rate.
If your queues are mostly empty then it can handle the load you gave it and have room for spikes. It's just math.
Queues can be full most of the time without the system being in a precarious state. Take for example an extremely spiky pattern of processing: once a day, a big batch of jobs is delivered. You know ahead of time how many items there are and have a pretty good handle on how long it takes to process one item. The items should be processed before the next batch arrives. You can then tune your system to empty the queue just before the next batch arrives. The queue is non-empty the majority of time, yet the system is perfectly stable.
Now, that is an extreme example, but similar behavior exists. It’s fine to have queues partially filled most of the time, as long as your average rate of processing exceeds the rate of incoming items by some margin.
AS I said in the other post I am talking about the average and most likely use case. a flat and continuous rate of data processing and varied rates of data production.
Spiking the processing is not what people think about with queues and is most likely NOT what the author of the article is talking about.
Typically there's no reason why you would spike your processing unless such processing doesn't scale linearly with the amount of items processed (as the author in a sibling post to yours mentioned). Such a case must be deliberately introduced into the discussion as an exception to the model because it's not what people think about when discussing this. Or maybe it is given that 3 people brought the same exact exception up. Welp if that's the case then I can only say the author of this article and I certainly aren't referring to this use case.
edit: I'm rate limited. So I'm referencing another post in my response that you won't see until the rate limit dies and I post it.
My comfort is a measure of business aversion to risk. It depends on the business of our respective industries.
Either way you have admitted that you tuned the system to operate with an acceptance of risk. Likely for cost reasons. This is fine, if you deliberately chose to do this.
When humans communicate. We don't talk about exceptions where you deliberately choose to operate outside of the norm. If you deliberately choose to drive a car without a seat belt knowing this risk. This is fine.
You want to sky dive while deliberately choosing not to use a parachute? That's fine. But don't expect me to adjust my communication to account for your exception. You chose to do this.
But let's not pretend that you don't know about this fact in human communication. It's an obvious thing you know. Yet you chose to characterize your post in a way that negatively makes my statement look unreasonable and your actions look normal. Please don't do this. If you choose to sky dive without a parachute, that's fine, but admit you are operating with risk for the sake of cost. Do not try mischaracterize my intentions, that is just rude.
> You need to build your system such that it stays out of unsustainable states as much as possible.
You assert that without justification. Typical counter-example would be system where you have predictable daily load-pattern, it can be perfectly reasonable to build the system that accumulates items during peak hours and clears queues during quiet hours.
You are making a somewhat subtle error here, which is assuming that the (output_rate-input_rate) is statistically independent of the size of the queue. Note that this does not necessarily imply that the output process changes based on the size of the queue. For example, consider a microphone that emits 100 raw samples to a buffer per ms and a real-time thread scheduled every 2ms that processes very quickly. a 300-sample buffer is more than enough here, even though the buffer will be over half full about a quarter of the time, and will never be empty. In this case, the velocity conditional on size being less than half is -100 samples/ms and the velocity conditional on size being greater than half is 400 samples/ms.
This example suffers from the same problem as the other counter examples. Why are you deliberately slowing down the processing to only happen at 2ms intervals? If you have continuous processing the queue will be empty always assuming processing is faster then data production.
You're maybe referring to what the other person mentioned... processing that does not scale with the amount of data processed so you want the processing to operate on the biggest batch of data possible. This is, while a common use case, not common enough that it should be assumed without be mentioned.
However, in your example it seems to be happening on the same computer (you mentioned mic), (and thus not web, which is what's the assumed context here) therefore data production rate doesn't seem to spike. If your processing is faster then data production you don't even need a queue, just stream each frame straight to processing in a single thread. But this also assumes the nature of your processing. The details for the nature of the processing can change things to require a queue for something like the averages of frames at discreet intervals... but that's different from the use case of queues like kafka or zmq for which I think is the main context here.
Edit: I reference other responses here, but you won't see them immediately. I'm currently rate limited and I have those replies in a queue so those will be posted later today.
The author has a good sentiment, but wrong expression. It's not that "queues should be empty", it's that "you should be emptying your queues". Whether they reach an empty state is irrelevant, it's just a sisyphian task that should always be working towards empty, and ideally, never actually reaching empty.
A good comparison is a restaurant's order queue.
If the order queue is always empty, that means no one is ordering food at the restaurant. That is an undesirable state. We want people to be ordering food, an empty queue shows that we have a problem in getting people to eat at our restaurant.
We don't want our queue to be always growing. If it's always growing, that means we're understaffed, and people waiting for their meals will get frustrated. Some will wait and leave a bad review, others might leave before they ever get their meal--and won't pay for it, either. Lost time and effort on our part.
But an order queue that almost always has two meals in the making is golden. It means we're getting consistent clientele to dine at our restaurant, that meals are being cooked and served in good time, diners are happy with the experience, and our system operations are healthy.
I think the author of TFA understands this, but stopped a little short of the optimal expression of their ideas. It's not about what you have, it's about what you should always be working towards.
You’ve got it all wrong. If you’re running a restaurant and orders never back up then you have too much staff, too big a kitchen, and are losing money by having too much capacity. You want to have a lunch rush, for some people to leave because you’re too busy. When the queue is backed up is when you’re the most profitable both per hour and per unit sold. Then you want to find efficiencies to improve throughout at the times you’re backed up.
You need to intentionally design your bottlenecks not pretend that you can avoid them completely.
They should be empty if you wish to minimize latency. For example, when emergency room queues are full in hospitals, that is a very undesirable scenario. The same can be applied in software - there are parts of systems that should only use queues to handle spike workloads. And the longer these queues are on average, the worse the overall performance of the system is. Just like patient health outcomes drop with longer ER queues.
And really, we need more context to say whether a queue should be empty, trend towards being empty, or towards always having tasks. I don't think it's possible to make a general prescription for all queues.
They should be empty to reach absolute minimal latency (e.g. the only latency is the time it takes to process the task, which you could call 0 latency). Different services have different latency targets.
The right statement is not "they should be empty" but "they should be under the target latency", which might be 0 (then "the queues should be empty"), or might be some other target (then empty means overprovisioned resources).
Everyone in this thread seem to be talking over each other because they imagine vastly different services (e.g. shipping company vs hospital emergency rooms).
If your queues are empty, you're wasting money paying your cloud / hosting provider too much. What matters is utilization. Once utilization hits ~80%, your queue will grow rapidy. Keep utilization below 80% and you've got a decent tradeoff between throughput and latency. It's a good rule of thumb for networks, and it's a good rule of thumb for services.
If you're queue should truly be (not heading toward) empty with the goal of minimizing latency, then you don't need a queue at all. You need more processors.
Distributions are useful for optimizing. But the decision whether a queue should generally be empty or have elements comes out of system requirements.
Then you can employ statistics to plan what resources you will allocate to processing the queue.
Though having capacity for Poisson event spikes isn’t nearly the only way to ensure queues trend to zero elements. For example, you can also scale the workload for each queue element depending on the length of the queue.
It’s very popular in video game graphics to allocate a given amount of frame time for ray tracing or anti aliasing, and temporally accumulate the results. So we may process a frame in a real time rendering pipeline for more or less time, depending on the state of the pipeline. Say, if the GPU is still drawing the previous frame, we can accumulate some ray traced lighting info on the CPU for the next frame for longer. If the CPU can’t spit out frames fast enough, we might accumulate ray traced lighting kernels/points for less time each frame and scale their size/impact.
Statistical distributions of frame timings don’t figure here much as workloads are unpredictable and may spike at any time - you have to adjust/scale rendering quality on the fly to maintain stable frame rates. This example assumes that we’re preparing one frame on the CPU while drawing the previous one on the GPU. But there may also be other threads that handle frames in parallel.
Other workload spikes could be predicted by things like Poisson distributions. But once again, the key takeaway is that you can’t generalise this stuff in any practical application. It all depends on system requirements and context.
Additionally, they must not be empty to maximize throughput of consumers which benefit from batching. If there's only ever one item to grab, consumers cannot amortize fetches across a batch, which impacts throughput, without any inherent benefit to latency.
(Imagine sending a TCP packet for every individual byte that was enqueued on a socket!)
Same applies if you have any sort of load shedding on the consumer side. If the queue is empty, it means you've shed load you didn't need to; when the next brief drop in load comes, you can't take advantage of it.
(Fun analogy: this is why Honda hybrids typically don't charge their battery all the way while driving, to allow the chance to do so for free while braking.)
This only works in very predictable workloads. Once your average input rate becomes slightly larger than consumption rate your system is screwed. Over provision ensures spikes can be processed along with average rate in a general increased load state for some time.
True, my example is an oversimplification. In reality you want to account for rapidly and slowly changing demands. This can be solved with elasticity of systems or by overprovisioning. You can get deep, with time series analysis, low pass filtering and various metrics, such as quantile distributions… however, saying zero queue length is desirable is a gross oversimplification.
Queues should have an ideal size based on your latency requirements.
Say, you have 100 requests coming in per second. Each consumer of your queue requires 0.3 seconds per request. Now, you can optimise for the number of consumers. Would you choose 40 consumers, then the probability of serving the request with an empty queue in 0.3 seconds would be 17% (and > 0.3 seconds would be 84%). However, the total waiting time per request would be 0.1 second.
With 80 consumers, the probability of an empty queue would be higher at 58%. However, most of those consumers are doing nothing, while your waiting time is only reduced by 0.1 second! Only a 30% improvement.
(edit: probability of empty queue in second example is higher, not lower)
I'm not awake/invested enough to make up numbers and do the math here, but shouldn't the cost for provisioning workers and value provided at various latencies matter? Like, each worker you provision costs a certain amount and shifts the latency distribution towards zero by a certain amount, and in principle there should be some function that converts each distribution to a dollar value.
Like, suppose you had a specific service level agreement that requires X% of requests to be served within Y seconds over some time interval, and breaking the SLA costs Z dollars. Each provisioning level would generate some distribution of latencies, which can get turned into likelihood of meeting the SLA, and from there you can put a dollar value on it. And crucially, this allows the "ideal" amount of provisioning to vary based off the relative cost of over and under-provisioning; if workers are cheap and breaking the SLA is costly, you would have more workers than if the SLA is relatively unimportant and workers are expensive.
Yes, that’s an interesting topic, especially given the prevalence of distributed compute nowadays and the rising awareness of cloud costs.
In the end, the distribution is not really poisson, of course. So, you might be interested in low pass filtering to elastically scale your provisioned workers. There is quite some theory about this, including sophisticated machine learned models to predict future load. But I digress.
Spiky workloads, safely persisting data, decoupling components, and enabling asynchronous processing are all things that can be provided by a stack or a heap as well. The call stack in any given programming language is a good example of a stack for handling spiky workloads, for instance. A min-heap is more flexible than a queue for handling async/decoupled/pubsub stuff (and arguably more correct for many use cases).
The thing that queues are good for is enforcing order: when I get to the front of the queue and it’s time to do my thing, I have a guarantee that anyone who entered the queue before me has already done their thing, and anyone who entered the queue after me has not yet had a chance to do their thing. This is a useful guarantee for tasks that have an order dependence, for example database operations looking to guarantee consistency.
> If the queue is not empty, it means that whatever job is consuming from the queue is not keeping up with messages coming into it.
You could conceivably have a non-zero queue size that asymptotically reaches some stable size over time. In such a case you are keeping up with messages but with some approximately static amount of added latency.
I don’t think I’ve seen this happen in practice but I think some systems can be engineered to be approximately stable (grows and shrinks but within predictable bounds)
This is very common when you can adjust the consume rate. This happens in practice: serverless platforms scale, Kubernetes has HorizontalPodAutoscaler, you can manually grab more VMs, or IRL this might mean calling in more staff or at larger time scales hiring more people in your organization.
Alternatively stated, queues are safest in a steady state. If they are growing in size, that is clearly bad if not bounded. If they are shrinking in size, that means you are going to be idling soon. I think it is fair that if you are picking between those states, you'd prefer the idling.
That said, consider a physical machine. The "hopper" on many is essentially a queue to feed in items to work on. If you let that go empty, you messed up. Hard.
If a queue is just being used to decouple a producer/consumer pair, then "it should be [almost] empty" is a reasonable notion, assuming consumer/producer are compute bound and there's no fundamental reason why one should be faster than the other...
Which brings us to the other type of use where things are being enqueued specifically as a means of waiting for resources that are in limited supply: maybe just compute if these are computationally heavy jobs, or maybe I/O bandwidth, etc. In that case we expect the queue to often be non-empty, and there's nothing wrong with that, even if there's potential to throw money/hardware at it to get the queue size down to zero!
I agree with author on the point that queues are best to handle spikes. But when you are talking about scale dealing with millions of messages in short span of time for extended period, queues can never remain empty. Empty queues in this situation would mean that your concurrency is high. When your concurrency is high, this would result in too many concurrent connections to your database which wouldend upcreating bottlenecks for your system.
The system should be designed to handle as many messages as possible ensuring that your resources don't get over burdened. Empty queues where your messages keep getting rejected because your resources are being hammered is not necessarily an efficient system.
> When your concurrency is high, this would result in too many concurrent connections to your database which wouldend upcreating bottlenecks for your system.
If you have a polling queue, as the number of workers grows too large, the overhead of maintaining connections from (worker) -> (stuff in the middle) -> (db or whatever keeps queue state) becomes too high. In the extreme case, you can get into a point of diminishing returns, where adding more workers hurts both throughput and latency in a bad way.
That is - initially you can decrease latency by sacrificing utilization and just adding more workers that usually sit idle. This increases throughput during heavy load, and decreases latency. Until you can't, because the overhead from all the extra workers causes a slowdown up the stack (either the controller that's handling messages, or the db itself that's got the queue state, or whatever is up there). Then when you throw more overhead, you increase latency and decrease throughput, because of the extra time wasted on overhead.
It depends on a lot of different variables.
If you have work items that vary significantly in cost, it can add even more problems (i.e. if your queue items can vary 3+ orders of magnitude in processing cost).
Throttling stuff is another valid use case. I‘ve used a pub/sub for sending e-mail for a newsletter because we couldn‘t send too many of them at once so the consumer would run in a fixed interval and always pick up a fixed number of messages.
There are entire languages built upon this model! Anything using the BEAM is a great example. Erlang, Elixir, Gleam, etc.
The runtime acts as an event loop and if a specific program is hogging too much time on the CPU, it will get paused and thrown to the back of the execution stack. This is all transparent (relatively) for the developer.
Except most times people don't implement the surrounding infra correctly.
Let's say you consume a job, do something, but that something fails. You have to set it to be rescheduled somewhere. Lets say your things writing jobs to the queue fails. That needs to be handled as well.
"They will be processed later [if I use a queue]" is not true by default (without work and thought) and assuming so will get you in trouble, and will also make certain people on your team who do understand that isn't the case upset.
I had a course about queue theory in my SWE degree. I left the course with a deep sense of dread because I knew most people studying CS and SWE in other universities would be oblivious to such a fundamental piece of knowledge.
Basically everything that processes data has a queue attached to it, from your router to your scheduler. Not knowing anything about queue theory is such an insult to the profession of SWE. It's like being a Mech engineer and not knowing about control of dynamic systems.
Why did you think you knew that? I'm also one of your supposed few to have studied it, I never thought that.
The dread should have been that most companies/teams won't care, won't engineer on that level, will just set it up however and make it bigger/faster/more-consumed if it's too slow.
In my uni, the class was more or less unadvertised and the professor largely ignored — you had to go out of your way to take it (I was actively looking for poorly rated professors, because I figured they were usually rated such because they graded hard… which probably meant they cared more). When I tried to google the subject/code examples afterwards, I largely only found textbooks and one-off blog posts.
Also a lot of GitHub repos with names like CS550…
Only place I’ve seen it referenced since that class was:
1. HN periodically
2. A YouTube documentary about Disneyland fast pass queues
3. A business optimization consultancy
Ultimately, a pretty niche subject area AFAICT. But implementing a fairly efficient queuing simulation seems to be pretty easy, and the subject seems to be stupid powerful and broadly applicable… so I’ve never understood why it’s not more discussed
Compared to what, as if say information theory comes up (explicitly) all the time? Seems like you could say that about any more theoretically oriented course.
An always empty queue means you could run into problems when traffic increases.
I've seen this come up recently, where our queue reader was(well, still is) over scaled, where it could handle 100x the normal queue throughout. When the input rate gets higher though, instead of the queue filling a bit and controlling the traffic, we hammer our dependencies.
I now see a continuously empty queue as a smell. It should be filling and emptying over and over again. Not all the time, but a queue that never fills is a ticking time bomb
What you're describing (empty queue = unprepared for load) is a code smell, but the solution doesn't have to be "fill the queue sometimes". It could also be:
* (Mandatory) Have a back-pressure system.
* (Optional, depends on system requirements) have auto-scaling of queue readers.
* (Mandatory) Make sure the system is regularly load tested.
In a complex workflow that has queues acting as buffers for work items between processes, the intermediate queue sizes depends on the turnaround time promised to the end-customer (which is another way of saying rate of customer demand).
If overall customer demand is x widgets per unit of time, if you design your system to process y widgets per unit of time, you will have overprovisioning (and hence impact to profits) if y >> x (y very large relative to x). As a corollary, if you design your system where y < x (y smaller than x), then you will not meet SLAs for some customers.
This analysis can be done at each subsystem level that is a component workflow and in reality some subsystems may be overprovisioned while some are underprovisioned. The goal should be to match the overall rate of customer demand.
You also design the system to handle a certain level of spike in demand, but the system will fail when demand exceeds capacity. When that happens, you want one place where you want to manage flow and invariably that will be a critical queue.
Queues are great as a buffer to even out compute utilization. With an empty queue, most workers can be shut down after a while, saving on compute cost. When the queue grows and stays non-empty, more workers can be started up. The length of the queue can be used to estimate the number of workers to spin up, as a metric for real time capacity planning.
Queues can also be used to coalesce events. You can run through the queue and see if there are repetitive events which does not have any effect multiple times and can be combined to a single event. This would reduce the work done at the consumer processing these events.
Think about the medical system, people hate to be put onto waiting lists (queues), but empty queues means that extremely expensive resources (operating theatres, staff, surgeons) are sitting idle which is very inefficient. Ideally queues mean that another patient is always there when the operating theatre becomes idle. Because medical need is not smooth and constant sometimes they fill up, sometimes they drain - they allow planners to match average need to resources.
So non-empty queues are a good thing, even for hospital waiting lists.
(To forestall arguments, of course the operating theatre example is not a pure queue example, some patients get put to the head of the queue according to triage - for the purpose of this example it doesn't matter)
> When in use, queues are typically always close to full or close to empty due to the differences in pace between consumers and producers. They very rarely operate in a balanced middle ground where the rate of production and consumption is evenly matched.
This has always been my experience. Has anyone ever had a perfect goldilocks scenario between 2 threads?
One good use-case for still using queues is if your producer or consumer has spiky characteristics on small timescales, but predicatable perfomance if you zoom out enough. In that case your queue size spikes to some high value, but always returns to 0.
The reasons for this can be numerous, maybe you produce a lot of jobs at once because of a user interaction, or your consumer is running slower for five minutes because some cron job is stealing your resources.
>This way, the producer doesn’t need to know anything about the consumer and vice versa.
Both components needs to agree on shared data, down to what each field actually means. The producer is just not saying What To Do with the message.
Take an example of processing a purchase order in multiple, asynchronous steps. While the webpage button spins there's a whole dance of agreed messages and expected outcomes that needs to happen in order for it to work. And there is no decoupling in sight.
My take is that Decoupling is a vague word. One should design distributed systems assuming that coupling is manageable, not inexistent.
>you could submit the task and ACK immediately allowing the job to finish in the background.
Should this title be given some sort of limiter/qualifier as in "Messaging queues should be empty"? I come from the video production world where you have things like render queues. These are designed to utilize a limited number (due to sheer expense) of hardware instances where if they finished the jobs so fast that the queue was always empty would indicate you have too many very expensive machines. A queue that is never empty, but is just finishing as new big pushes to the queue happen would be the ideal balance.
The most important reason for queues is the variability of job inter-arrival times and service times. If there was no variability there would be no need for queuing.
Queue can also allow batching. It might be that you need to consume or pack or ship N items at a time, otherwise you're losing money. Trying to aim for minimal latency would literally hurt your business.
This article obviously has a very specific kind of workload/service/target in mind that they fail to define and they overgeneralize. This is bad advice in many situations that they failed to imagine.
He's right. You guys need to see the system in terms of it's derivative. The rate of speed in which data is produced and consumed.
This article is saying you basically want an average negative velocity where consumers eat data faster then producers on average.
Any queue that is non empty means the velocity is positive. Producers are making data faster then consumers and you have a limited time window before the queue becomes full and your system starts dropping data.
If you observe that your system never has a full queue but is constantly loaded with partially filled queues it means you built and tuned your system such that it operates on the border of failure.
It means your system is continuously hitting unsustainable velocities all the time, but stops short of total failure (a full queue).
Remember, think about the system in terms of it's derivative. The average derivative must be negative for your system to even function. Any time a queue is filled partially or full it means there was a period of positive velocity. These periods cannot exceed the amount of time the system is in negative velocity.
This is a somewhat uninsightful observation that only has people responding to it because of pedantic arguing. I don't think anyone even disagrees with the author, they're just arguing about words.
If you look at a queue at any random time, it will have necessarily increased more than it has decreased over its lifetime, otherwise it would hold negative items.
You still haven't defined what "heading in the direction of empty" means. Like I said, your queue can't be decreasing in size over its history, and it cannot be decreasing in size most of the time either, if it started empty.
Right but in reality you can't actually hit "always having exactly one thing to process" without maxing out your queue. It's actually impossible! Because they are actually the same thing. "Always having something to process" means no idle time - which means your queue is forever long :) Aiming for that ideal is not a good idea because it's misguided (for lack of a better term).
The way I see it is that there are two opposing things you can optimise for that depend on queue depth: utilisation or latency.
If you care about processing each message as quickly as possible then queues should be empty. This often requires a higher cost and lower utilisation as you inevitably have idle workers waiting for new messages.
If you care about utilisation, then you never want your queue to be empty while something is polling it. This might be some background task that runs on a GPU - every second that sits idle is wasted cash, and those tasks usually benefit heavily from batching inputs together.
In this case you want it to always have something to read from the queue and shut it down the moment this isn’t the case.