Resque isn't so great after all. We used it with around 600-1000 jobs per minute...

antirez · on Sept 23, 2012

> "way more reliable than Redis."

I respect your opinion on different messaging solutions (Resque VS Beanstalkd) however since we spend a lot of efforts to make sure Redis is very reliable I wonder what's the reliability problem you experienced with Redis. Thanks!

nirvdrum · on Sept 23, 2012

A different issue, but one that's burned me several times is when things go wrong with resque (or sidekiq), life goes to hell pretty quickly because redis needs to fit everything in memory. E.g., if a third party resource goes down and jobs hitting it all of a sudden start exceptioning and get thrown into a retry queue, all those stacktraces sitting around in memory accumulate very quickly. redis fills up, the OS kills it, it starts back up, loading an older DB, and then restarts processing all those jobs that just failed, without an exponential backoff. And this process will basically repeat ad nauseum.

I'm not sure a whole lot can be done, other than resurrecting the old debate over something like diskstore. And while EC2 is memory-constrained, dealing with this situation isn't a matter of just adding more memory, because that will be filled up as well. redis would either need to be able to spill over onto disk or the processing model would need to be changed. We ended up doing the latter, eschewing stacktraces for retried jobs, adding monitoring to disable queues that have a very high failure rate, etc.

rahoulb · on Sept 23, 2012

I'll second this. It's the Resque backtraces that have caused us a few problems. And on a VPS RAM is your most expensive resource, possibly making Resque/Redis not such a good fit.

pimeys · on Sept 23, 2012

I don't know was it our devops lack of knowledge on Redis, but we had couple of times when we just lost thousands of work items.

Redis is very good for different data structures, like hashes or lists, but for queuing I really prefer Beanstalkd over anything.

andrewvc · on Sept 23, 2012

That was a pretty big accusation to throw around, that red is just lost data. It was more likely resque or perhaps even your own fault.

As far as perf goes, redis queuing is just a linked list. It is ridiculously fast, I would really like to see numbers backing up your claims.

If you are going to make accusations like this you should at the very least understand why stuff broke.

pimeys · on Sept 23, 2012

Well, of course we'll need a retrying system with delay, so that's a sorted set then together with the work item list.

I prefer simple solutions and I know it was our fault we got a data loss with Redis. Still Beanstalkd queues have been causing less pain than Resque and Redis together. Integrated delays, tubes and deadlines are really nice.

Altogether the reliability comes from the deadlines. If my worker gets stuck (well, it's Ruby), we don't lose the work item. Doing pop on Redis will delete the item from the queue, but in Beanstalkd you reserve the item and delete it when you're done.

mercuryrising · on Sept 23, 2012

If I ever come to Italy, I'm buying you a beer.

qixxiq · on Sept 23, 2012

We're actually using a combination of Beanstalkd and Redis (to make up for beanstalks lack of durability).

Before pushing a job to beanstalk we encode the entire job a 'current-jobs' hashmap in redis, our worker transfers it to a 'busy-jobs' hashmap before processing, and then deletes it if successful or transfers back to current-jobs if failed gracefully.

This way we get the full useful features of Beanstalk (tubes, delays, etc) while knowing that our jobs are always sitting in Redis too if the beanstalk server dies (its persistence features are lacking) and our commit function is atomic (Redis MULTI/EXEC).

netvarun · on Sept 23, 2012

Pretty curious why you are beanstalkd instead of using Redis (BRPOP & BLPOP) for directly handling your job processing?

pimeys · on Sept 23, 2012

For our purposes Beanstalkd has tubes and built-in delays for jobs. We can also set up a deadline for work items to prevent loss if the worker fails.

pjscott · on Sept 23, 2012

Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang? Can you push back when queue lengths get too high at some part of a multi-stage pipeline? And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths. And how about master-slave replication, failover, sharding, and so on?

We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system. All of the features I mentioned above are things that we've had to build ourselves. Redis is more like a general-purpose building block for all kinds of data systems.

pimeys · on Sept 24, 2012

> Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang?

Our queue items are plain id values, which trigger a set of actions from the database to the internet. If there's a database failure or the process itself crashes it is very nice to know our reserved work items will not be gone but released back to other workers to process.

> And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths.

I have lots of graphs from the system in Graphite. Works perfectly.

> And how about master-slave replication, failover, sharding, and so on?

Sharding is easy to do, just specify an array of servers in the clients and the clients will shard.

Replication is a bit different problem. The server will write a binlog file to the disk, which is then backed up. We have beanstalkd servers waiting in another machine pointing to the same binlogs. On an error situation we just switch them on and set our routes differently.

Yeah, the biggest problem with Beanstalkd is the missing replication, but we can live without it.

> We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system.

We're not doing so heavy processing, but we're relying on many third party services, which can fail randomly. The retrying system is a must and of course we cannot miss our work items so often, so the deadlines help.

dools · on Sept 23, 2012

I'm curious to know what's lacking in the -b switch which maintains a binlog in beanstalkd

robotmay · on Sept 23, 2012

Wow, someone else using Beanstalkd; my hero! Beanstalkd may be one of my favourite but most neglected pieces of software.

pimeys · on Sept 23, 2012

It is a blast. Although we had our concerns because you can't see what's inside a Beanstalkd queue (duh, because it's a queue), but building a system for monitoring the failures was not a big deal.

And of course the tubes rock. Now I can build a separate tube for each retry, I can set different priorities for the tubes (so the more fail prone tubes won't block all workers), I can set a deadline for a job to finish and if the worker dies in the middle the job will be returned for the other workers. And best of all, I can monitor all tubes nicely with Graphite.

fullmoon · on Sept 23, 2012

Why do you feel that a queue is inherently unpeekable?

pimeys · on Sept 23, 2012

You can peek, but it's not built for looking through all the items in it. It is doable, but not very efficient.

ozataman · on Sept 23, 2012

Do you have a custom solution for monitoring via Graphite? If so, is it something you can share?

pimeys · on Sept 24, 2012

It's pretty straightforward. Mostly we have custom data not related to Beanstalkd to monitor, but basically there's two scenarios:

1. When doing something, increment a counter in Statsd, which will aggregate the values to Graphite. Works well for custom stuff.

2. Write a rake task to poll Beanstalkd with `stats` command, reduce the data and send it straight to Graphite.

I'm really planning to open source the queuing system we're using because it's so simple. It might need a prettier admin view and some nice way of plugging in more monitoring systems before I put it to public.

ericholscher · on Sept 23, 2012

We also use beanstalk a ton where I work. We have put multiple billions of jobs through it over the past few years. It's rock solid and one of my favorite pieces of software.

sandGorgon · on Sept 23, 2012

Why threads instead of Eventmachine ? I'm not an expert in this, but the usual how-to for workers is Eventmachine.

EDIT: It seemed familiar, so I dug up an old thread of yours [1], where we discussed a similar thing. At that time, it seemed you narrowed down your problems to Eventmachine, but were still using Redis. Has that changed ?

[1] http://news.ycombinator.com/item?id=3366470

pimeys · on Sept 23, 2012

I made a gem called em-resque [1] for running Resque workers inside an EventMachine. It was fast, but the lack of good libraries, like Curl::Easy for evented code was a big problem. Also the exceptions EventMachine gave us or the lack of exceptions when missing work items was just plain awful. We spent several months catching these bugs and made even some progress together with nice folks from resque-retry and Resque teams.

Then I had a chance to re-build the system. I couldn't be more happy with it. It scales well, it's pretty fault proof and almost as fast as my EventMachine solution. Threads are slow, but the real slowness comes from the IO. Now instead of em-http-request I have much better working Curl::Easy and the whole codebase is much easier to understand.

[1] https://github.com/SponsorPay/em-resque