In my experience, a queue system is the worst thing to find out isn't scaling pr...

mlyle · on Sept 25, 2023

> and are confident you'll be working at tens of tasks per second forever.

It's more like a few thousand per second, and enqueues win, not dequeues like you say... on very small hardware without tuning. If you're at tens of tasks per second, you have a whole lot of breathing room: don't build for 100x current requirements.

https://chbussler.medium.com/implementing-queues-in-postgres...

> eventually your dequeue queries will throttle each other's locks a

This doesn't really make sense to me. To me, the main problem seems to be that you end up with having a lot of snapshots around.

Karrot_Kream · on Sept 25, 2023

> https://chbussler.medium.com/implementing-queues-in-postgres...

This link is simply raw enqueue/dequeue performance. Factor in workers that perform work or execute remote calls and the numbers change. Also, I find when your jobs have high variance in times, performance degrades significantly.

> This doesn't really make sense to me. To me, the main problem seems to be that you end up with having a lot of snapshots around.

The dequeuer needs to know which tasks to "claim", so this requires some form of locking. Eventually this becomes a bottleneck.

> don't build for 100x current requirements

What happens if you get 100x traffic? Popularity spikes can do it, so can attacks. Is the answer to just accept data loss in those situations? Queue systems are super simple to use. I'm counting "NOTIFY/LISTEN" on Postgres as a queue, because it is a queue from the bottom up.

mlyle · on Sept 25, 2023

> Factor in workers that perform work or execute remote calls and the numbers change.

These don't occur on the database server, though... This merely affects the number of rows currently claimed.

> The dequeuer needs to know which tasks to "claim", so this requires some form of locking. Eventually this becomes a bottleneck.

These are just try locks, though-- the row locks are not contended. The big thing you run into is having lots of snapshots around and having to skip a lot of claimed rows for each dequeue.

> What happens if you get 100x traffic? Popularity spikes can do it, so can attacks.

If you get 100x the queueing activity for batch jobs, you're going to have stuff break well before the queue. It's probably not too easy to get 100x the drain rate, even if your queue system can handle it.

This scales well beyond 100M batch tasks per day, which gets you to 1M users with 100 tasks/day each.

sgarland · on Sept 25, 2023

> What happens if you get 100x traffic?

Throttle the inputs. Rate-limiting doesn’t belong to the data layer.

While throttling due to organic popularity isn’t great, I’d argue the tradeoffs might be worthwhile. If it looks like the spike will last, stand up Redis during the throttling, double-write, and throttle down the Postgres queue until it’s empty. If you really need to, take a 15 minute outage to just copy data over.

nerpderp82 · on Sept 25, 2023

What happens when you get 500x the traffic or 50x?

How does the system behave when the traffic rate is higher for which it was designed for or can currently handle? Because that number will always be there, even in a "scalable" system. One won't be able to add capacity at the same rate that work will increase.

asdfaoeu · on Sept 25, 2023

NOTIFY/LISTEN isn't a queue it has broadcast semantics. Postgres queueing is really just the SELECT FOR UPDATE SKIP LOCKED, the NOTIFY/LISTEN allows you to reduce the latency a bit but not essential.

hu3 · on Sept 25, 2023

> What happens if you get 100x traffic?

This line of reasoning is desirable for FAANGS, but can bankrupt startups that need to move fast and get shit done.

klauserc · on Sept 25, 2023

If you find yourself in that situation, migrating to a more performant queuing solution is not that much of a leap. You already have an overall system architecture that scales well (async processing with a queue).

_Ideally_ the queuing technology is abstracted from the job-submitters/job-runners anyway. It's a bit more work if multiple services are just writing to the queue table directly.

I agree that the _moment_ the system comes to a screeching halt is definitely not fun.

asdfaoeu · on Sept 25, 2023

You are going to have the same scaling issues with your datastore. I don't really understand why you say that your dequeue queries will throttle each others locks and grind it to a half? Isn't that the whole point of SKIP LOCKED?