Resque isn't so great after all. We used it with around 600-1000 jobs per minute and not without problems. The failure backends could be better. We need the retry backend and for that you have to install another gem. The retry feature was kind of not working so well and really hard to debug (thanks to EventMachine and Resque's a bit too complicated failure backend).
Our solution was to build yet another queueing and worker system. For queues Beanstalkd is way faster and way more reliable than Redis. Instead of EventMachine I built the workers with Ruby threads and making a retry system on top of it wasn't really a problem.
Our error percentage dropped from around 10% to 0.0001% on average. Maybe some day I might release this as a gem, if world really needs yet another queueing system.
I respect your opinion on different messaging solutions (Resque VS Beanstalkd) however since we spend a lot of efforts to make sure Redis is very reliable I wonder what's the reliability problem you experienced with Redis. Thanks!
A different issue, but one that's burned me several times is when things go wrong with resque (or sidekiq), life goes to hell pretty quickly because redis needs to fit everything in memory. E.g., if a third party resource goes down and jobs hitting it all of a sudden start exceptioning and get thrown into a retry queue, all those stacktraces sitting around in memory accumulate very quickly. redis fills up, the OS kills it, it starts back up, loading an older DB, and then restarts processing all those jobs that just failed, without an exponential backoff. And this process will basically repeat ad nauseum.
I'm not sure a whole lot can be done, other than resurrecting the old debate over something like diskstore. And while EC2 is memory-constrained, dealing with this situation isn't a matter of just adding more memory, because that will be filled up as well. redis would either need to be able to spill over onto disk or the processing model would need to be changed. We ended up doing the latter, eschewing stacktraces for retried jobs, adding monitoring to disable queues that have a very high failure rate, etc.
I'll second this. It's the Resque backtraces that have caused us a few problems. And on a VPS RAM is your most expensive resource, possibly making Resque/Redis not such a good fit.
Well, of course we'll need a retrying system with delay, so that's a sorted set then together with the work item list.
I prefer simple solutions and I know it was our fault we got a data loss with Redis. Still Beanstalkd queues have been causing less pain than Resque and Redis together. Integrated delays, tubes and deadlines are really nice.
Altogether the reliability comes from the deadlines. If my worker gets stuck (well, it's Ruby), we don't lose the work item. Doing pop on Redis will delete the item from the queue, but in Beanstalkd you reserve the item and delete it when you're done.
We're actually using a combination of Beanstalkd and Redis (to make up for beanstalks lack of durability).
Before pushing a job to beanstalk we encode the entire job a 'current-jobs' hashmap in redis, our worker transfers it to a 'busy-jobs' hashmap before processing, and then deletes it if successful or transfers back to current-jobs if failed gracefully.
This way we get the full useful features of Beanstalk (tubes, delays, etc) while knowing that our jobs are always sitting in Redis too if the beanstalk server dies (its persistence features are lacking) and our commit function is atomic (Redis MULTI/EXEC).
Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang? Can you push back when queue lengths get too high at some part of a multi-stage pipeline? And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths. And how about master-slave replication, failover, sharding, and so on?
We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system. All of the features I mentioned above are things that we've had to build ourselves. Redis is more like a general-purpose building block for all kinds of data systems.
> Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang?
Our queue items are plain id values, which trigger a set of actions from the database to the internet. If there's a database failure or the process itself crashes it is very nice to know our reserved work items will not be gone but released back to other workers to process.
> And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths.
I have lots of graphs from the system in Graphite. Works perfectly.
> And how about master-slave replication, failover, sharding, and so on?
Sharding is easy to do, just specify an array of servers in the clients and the clients will shard.
Replication is a bit different problem. The server will write a binlog file to the disk, which is then backed up. We have beanstalkd servers waiting in another machine pointing to the same binlogs. On an error situation we just switch them on and set our routes differently.
Yeah, the biggest problem with Beanstalkd is the missing replication, but we can live without it.
> We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system.
We're not doing so heavy processing, but we're relying on many third party services, which can fail randomly. The retrying system is a must and of course we cannot miss our work items so often, so the deadlines help.
It is a blast. Although we had our concerns because you can't see what's inside a Beanstalkd queue (duh, because it's a queue), but building a system for monitoring the failures was not a big deal.
And of course the tubes rock. Now I can build a separate tube for each retry, I can set different priorities for the tubes (so the more fail prone tubes won't block all workers), I can set a deadline for a job to finish and if the worker dies in the middle the job will be returned for the other workers. And best of all, I can monitor all tubes nicely with Graphite.
It's pretty straightforward. Mostly we have custom data not related to Beanstalkd to monitor, but basically there's two scenarios:
1. When doing something, increment a counter in Statsd, which will aggregate the values to Graphite. Works well for custom stuff.
2. Write a rake task to poll Beanstalkd with `stats` command, reduce the data and send it straight to Graphite.
I'm really planning to open source the queuing system we're using because it's so simple. It might need a prettier admin view and some nice way of plugging in more monitoring systems before I put it to public.
We also use beanstalk a ton where I work. We have put multiple billions of jobs through it over the past few years. It's rock solid and one of my favorite pieces of software.
Why threads instead of Eventmachine ? I'm not an expert in this, but the usual how-to for workers is Eventmachine.
EDIT: It seemed familiar, so I dug up an old thread of yours [1], where we discussed a similar thing. At that time, it seemed you narrowed down your problems to Eventmachine, but were still using Redis. Has that changed ?
I made a gem called em-resque [1] for running Resque workers inside an EventMachine. It was fast, but the lack of good libraries, like Curl::Easy for evented code was a big problem. Also the exceptions EventMachine gave us or the lack of exceptions when missing work items was just plain awful. We spent several months catching these bugs and made even some progress together with nice folks from resque-retry and Resque teams.
Then I had a chance to re-build the system. I couldn't be more happy with it. It scales well, it's pretty fault proof and almost as fast as my EventMachine solution. Threads are slow, but the real slowness comes from the IO. Now instead of em-http-request I have much better working Curl::Easy and the whole codebase is much easier to understand.
Our solution was to build yet another queueing and worker system. For queues Beanstalkd is way faster and way more reliable than Redis. Instead of EventMachine I built the workers with Ruby threads and making a retry system on top of it wasn't really a problem.
Our error percentage dropped from around 10% to 0.0001% on average. Maybe some day I might release this as a gem, if world really needs yet another queueing system.