> Anything but RabbitMQ. Would you mind elaborating on this? I'd be happy for ot...

mianos · on Sept 25, 2023

The software works excellently in a development environment and performs well when running as a single instance. However, I encountered issues when scaling it up for high availability in a clustered setup. The system would fail inconsistently, with two masters consuming messages simultaneously, which wasn't ideal for my use case. Eventually, I switched to Kafka and haven't revisited the original solution since.

It's worth noting that these issues might have been due to my improper configuration. Nevertheless, if the configuration process is fraught with pitfalls, that's problematic in itself. I've had these experiences more than once.

Additionally, I found a critical race condition in the Python library, rendering it practically unusable for me. I submitted a bug report with a minimal example demonstrating the issue. I considered fixing it myself, but since using RabbitMQ wasn't crucial for my project, I switched to ZeroMQ, which didn't require a broker. The issue was acknowledged and fixed about a year later. At the time, I had to assume that nobody else was using the Python bindings.

Three years ago, I worked on a project that used the software for a Celery queue. Messages would occasionally go missing, although this could have been a configuration issue on our part. Ultimately, we replaced it with a Redis queue (not the best practice, I admit) and didn't look back. This was for a lower-availability use case where a single instance of Redis sufficed.

boyter · on Sept 25, 2023

I used RabbitMQ for a while and nothing but problems.

Admittedly I probably shouldn't have used it the way I did. I dumped many millions of tasks into it, then fanned out processes pulling from that queue that took a variable amount of time to run. Some ran in seconds, some hours.

I had picked RabbitMQ because I wanted that queue to be durable and resist workers dying, or being restarted. However long lived tasks like this is not really what it was designed for (in my opinion). I kept running into issues where it would take a long time to restart, and stop answering connections and need a restart to continue. I ended up having to write monitoring code to check for this and handle it to have it be slightly reliable.

Im sure it works well for smaller short lived messages, but considering the issues I bumped into I would be hesitant to try it. Id probably reach to redis first with wrappers allowing me to swap out to any other queue as required first.

phamilton · on Sept 25, 2023

I can share our experience with RabbitMQ/SQS/Sidekiq. Our two major issues have been around the retry mechanism and resource bottlenecks.

The key retry problem is "What happens when a worker crashes?".

RabbitMQ solves this problem by tying "unacknowledged messages" to a tcp connection. If the connection dies, the in-flight messages are made available to other connections. This is a decent approach, but we hit a lot of issues with bugs in our code that would fail to acknowledge a message and the message would get stuck until that handler cycled. They've improved this over the past year or so with consumer timeouts, but we've already moved on.

The second problem we hit with RabbitMQ was that it uses one-erlang-process-per-queue and we found that big bursts of traffic could saturate a single CPU. There are ways to use sharded queues or re-architect to use dynamically created queues but the complexity led us towards SQS.

Sidekiq solves "What happens when a worker crashes?" by just not solving it. In the free version, those jobs are just lost. In Sidekiq Pro there are features that provide some guarantees that the jobs will not be lost, but no guarantees about when they will be processed (nor where they will be processed). Simply put, some worker sometime will see the orphaned job and decide to give it another shot. It's not super common, but it is worse in containerized environments where memory limits can trigger the OOM killer and cause a worker to die immediately.

The other issue with Sidekiq has been a general lack of hard constraints around resources. A single event thread in redis means that when things go sideways it breaks everything. We've had errant jobs enqueued with 100MB of json and seen it jam things up badly when Sidekiq tries to parse that with a lua script (on the event thread). While it's obvious that 100MB is too big to shove into a queue, mistakes happen and tools that limit the blast radius add a lot of value.

We've been leaning heavily on SQS the past few years and it is indeed Simple. It blocks us from doing even marginally dumb things (max message size of 256gb). The visibility timeout approach for handling crashing workers is easy to reason about. DLQ tooling has finally improved so you can redrive through standard aws tools. There are some gaps we struggle with (e.g. firing callbacks when a set of messages are fully processed) but sometimes simple tools force you to simplify things on your end and that ends up being a good thing.

imiric · on Sept 25, 2023

That's very insightful, thanks for sharing.

Do you have any experience with NATS, and how would you compare it to RMQ/SQS?

The authors claim it guarantees exactly-once delivery with its JetStream component, and it looks very alluring from the documentation, but looks can be deceiving.

phamilton · on Sept 25, 2023

> The authors claim it guarantees exactly-once delivery

I find this definition has morphed from one meaningful to developers into one queue implementations like to claim. I've learned this generally means "multiple inserts will be deduped into only one message in the queue".

The only guarantee this `exactly-once` delivery provides is that I won't have two workers given the exact same job. Which is a nice guarantee, but I still have to decide on my processing behavior and am faced with the classic "at most once or at least once" dilemma around partially failed jobs. If I'm building my system to be idempotent so I can safely retry partially failed messages it doesn't do much for me.

kureikain · on Sept 25, 2023

It has multiple mode. One of them is explicitly acknowlede mode. If the worker finished process the job but doesn't ack, the message will appear again.

chucke · on Sept 25, 2023

SQS limits you further in other ways. For instance, scheduled tasks are capped to 15m (delaySconds knob), so you'll be stuck when implementing the "cancel account if not verified in 7 days" workflow. You'll either reenqueue a message every 15m until its ready (and eat your the SQS costs), or build a bespoke solution only for scheduled tasks using some other store (the database usually) and another polling loop (at a fraction of the quality of any other OSS tool). This is a problem well solved by sidekiq, despite the other drawbacks you mention.

Bottom line, there is no silver bullet.

mtlguitarist · on Sept 25, 2023

If you wanted to handle this scenario with the serverless AWS stack, my recommendation would be to push records to Dynamo with TTLs and then when they pop have a Lambda push them onto the queue. Would cost almost nothing to do this. If you had 10 million requests a month your Lambda cost would be ~$150 to run this (depending on duration, but just pushing to a queue should be quick). Dynamo would be another ~$50 to run, depending how big your tasks are.

Granted now you need 3 services instead of 1. I personally don't find the maintenance cost particularly high for this architecture, but it does depend on what your team is comfortable with.

phamilton · on Sept 25, 2023

I've explored this space pretty thoroughly, including the Dynamo approach you've described. Dynamo does not have a strict guarantee on when items get deleted:

  TTL typically deletes expired items within a few days. Depending on the size and activity level of a table, the actual delete operation of an expired item can vary. Because TTL is meant to be a background process, the nature of the capacity used to expire and delete items via TTL is variable (but free of charge). [0]

Because of that limitation, I would not use that approach. Instead I would do Scheduled Lambdas to check for items every 15 minutes in a Serverless Aurora and then add them to SQS with delays.

I've had my eye on this problem for a few years and keep thinking that a simple SaaS that does one-shot scheduled actions would probably be a worthy side project. Not enough to build a company around, but maintenance would be low and there's probably some pricing that would attract enough customers to be sustainable.

[0] https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

throwawayjxu · on Sept 25, 2023

You could probably use AWS EventBridge and schedule the message to be posted to SQS in 7 days.

pqdbr · on Sept 25, 2023

> Sidekiq solves "What happens when a worker crashes?" by just not solving it. In the free version, those jobs are just lost.

I've been using Sidekiq for 11+ years in production and I've never seen this happen. Sidekiq (free version) has a very robust retry workflow. What are you talking about here?

yxhuvud · on Sept 25, 2023

He is talking about the case when the worker itself die for some reason. It can be due to for example when the worker died due to using too much memory or if it hits a segfault or whatever.

phamilton · on Sept 25, 2023

Yep. OOMs are the most common cause. It's definitely low frequency. On the order of one in a billion. For some systems that's once a year. For us that's once a week. If that's an important job and it just gets dropped, then you've got a problem.

With the paid features to keep it from getting dropped things still can be painful. We have a lot of different workers, all with different concurrency settings and resource limits. A memory heavy worker might need a few GB of memory and be capped at concurrency of 2 while a lightweight worker might only need 512MB and have concurrency of 20. If the big memory worker crashes, its jobs might get picked up by the lightweight worker (and possibly hours later), which will then OOM and all its 19 other in flight jobs all end up in the orphanage. And now your alerts are going off saying are saying the lightweight worker is OOMing and your team is scratching their heads because that doesn't make any sense. It just gets messy.

Sidekiq probably works great outside of containerized environment. Many swear to me they've never encountered any of these problems. And maybe we should be questioning the containerization rather than sidekiq, but ultimately our operations have been much simpler as we've moved off of sidekiq.

jashmatthews · on Sept 25, 2023

Sidekiq will drop in-progress jobs when a worker crashes. Sidekiq Pro can recover those jobs but with a large delay. Sidekiq is excellent overall but it’s not suitable for processing critical jobs with a low latency guarantee.

https://github.com/sidekiq/sidekiq/wiki/Reliability