Rescuing Resque: Let's do this

pimeys · on Sept 23, 2012

Resque isn't so great after all. We used it with around 600-1000 jobs per minute and not without problems. The failure backends could be better. We need the retry backend and for that you have to install another gem. The retry feature was kind of not working so well and really hard to debug (thanks to EventMachine and Resque's a bit too complicated failure backend).

Our solution was to build yet another queueing and worker system. For queues Beanstalkd is way faster and way more reliable than Redis. Instead of EventMachine I built the workers with Ruby threads and making a retry system on top of it wasn't really a problem.

Our error percentage dropped from around 10% to 0.0001% on average. Maybe some day I might release this as a gem, if world really needs yet another queueing system.

antirez · on Sept 23, 2012

> "way more reliable than Redis."

I respect your opinion on different messaging solutions (Resque VS Beanstalkd) however since we spend a lot of efforts to make sure Redis is very reliable I wonder what's the reliability problem you experienced with Redis. Thanks!

nirvdrum · on Sept 23, 2012

A different issue, but one that's burned me several times is when things go wrong with resque (or sidekiq), life goes to hell pretty quickly because redis needs to fit everything in memory. E.g., if a third party resource goes down and jobs hitting it all of a sudden start exceptioning and get thrown into a retry queue, all those stacktraces sitting around in memory accumulate very quickly. redis fills up, the OS kills it, it starts back up, loading an older DB, and then restarts processing all those jobs that just failed, without an exponential backoff. And this process will basically repeat ad nauseum.

I'm not sure a whole lot can be done, other than resurrecting the old debate over something like diskstore. And while EC2 is memory-constrained, dealing with this situation isn't a matter of just adding more memory, because that will be filled up as well. redis would either need to be able to spill over onto disk or the processing model would need to be changed. We ended up doing the latter, eschewing stacktraces for retried jobs, adding monitoring to disable queues that have a very high failure rate, etc.

rahoulb · on Sept 23, 2012

I'll second this. It's the Resque backtraces that have caused us a few problems. And on a VPS RAM is your most expensive resource, possibly making Resque/Redis not such a good fit.

pimeys · on Sept 23, 2012

I don't know was it our devops lack of knowledge on Redis, but we had couple of times when we just lost thousands of work items.

Redis is very good for different data structures, like hashes or lists, but for queuing I really prefer Beanstalkd over anything.

andrewvc · on Sept 23, 2012

That was a pretty big accusation to throw around, that red is just lost data. It was more likely resque or perhaps even your own fault.

As far as perf goes, redis queuing is just a linked list. It is ridiculously fast, I would really like to see numbers backing up your claims.

If you are going to make accusations like this you should at the very least understand why stuff broke.

pimeys · on Sept 23, 2012

Well, of course we'll need a retrying system with delay, so that's a sorted set then together with the work item list.

I prefer simple solutions and I know it was our fault we got a data loss with Redis. Still Beanstalkd queues have been causing less pain than Resque and Redis together. Integrated delays, tubes and deadlines are really nice.

Altogether the reliability comes from the deadlines. If my worker gets stuck (well, it's Ruby), we don't lose the work item. Doing pop on Redis will delete the item from the queue, but in Beanstalkd you reserve the item and delete it when you're done.

mercuryrising · on Sept 23, 2012

If I ever come to Italy, I'm buying you a beer.

qixxiq · on Sept 23, 2012

We're actually using a combination of Beanstalkd and Redis (to make up for beanstalks lack of durability).

Before pushing a job to beanstalk we encode the entire job a 'current-jobs' hashmap in redis, our worker transfers it to a 'busy-jobs' hashmap before processing, and then deletes it if successful or transfers back to current-jobs if failed gracefully.

This way we get the full useful features of Beanstalk (tubes, delays, etc) while knowing that our jobs are always sitting in Redis too if the beanstalk server dies (its persistence features are lacking) and our commit function is atomic (Redis MULTI/EXEC).

netvarun · on Sept 23, 2012

Pretty curious why you are beanstalkd instead of using Redis (BRPOP & BLPOP) for directly handling your job processing?

pimeys · on Sept 23, 2012

For our purposes Beanstalkd has tubes and built-in delays for jobs. We can also set up a deadline for work items to prevent loss if the worker fails.

pjscott · on Sept 23, 2012

Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang? Can you push back when queue lengths get too high at some part of a multi-stage pipeline? And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths. And how about master-slave replication, failover, sharding, and so on?

We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system. All of the features I mentioned above are things that we've had to build ourselves. Redis is more like a general-purpose building block for all kinds of data systems.

pimeys · on Sept 24, 2012

> Really, there are a lot of things you might want from a queueing system that Redis doesn't easily give you. What do you do when a worker crashes, or takes too long? What do you do to a queue item which reliably crashes a worker, or causes it to hang?

Our queue items are plain id values, which trigger a set of actions from the database to the internet. If there's a database failure or the process itself crashes it is very nice to know our reserved work items will not be gone but released back to other workers to process.

> And then there's all the front-end goodness for debugging: you could easily want things like graphs for flow and queue lengths.

I have lots of graphs from the system in Graphite. Works perfectly.

> And how about master-slave replication, failover, sharding, and so on?

Sharding is easy to do, just specify an array of servers in the clients and the clients will shard.

Replication is a bit different problem. The server will write a binlog file to the disk, which is then backed up. We have beanstalkd servers waiting in another machine pointing to the same binlogs. On an error situation we just switch them on and set our routes differently.

Yeah, the biggest problem with Beanstalkd is the missing replication, but we can live without it.

> We're using Redis for our big document processing-and-indexing pipeline at Cue, and it's great software, but it's not a ready-made queueing system.

We're not doing so heavy processing, but we're relying on many third party services, which can fail randomly. The retrying system is a must and of course we cannot miss our work items so often, so the deadlines help.

dools · on Sept 23, 2012

I'm curious to know what's lacking in the -b switch which maintains a binlog in beanstalkd

robotmay · on Sept 23, 2012

Wow, someone else using Beanstalkd; my hero! Beanstalkd may be one of my favourite but most neglected pieces of software.

pimeys · on Sept 23, 2012

It is a blast. Although we had our concerns because you can't see what's inside a Beanstalkd queue (duh, because it's a queue), but building a system for monitoring the failures was not a big deal.

And of course the tubes rock. Now I can build a separate tube for each retry, I can set different priorities for the tubes (so the more fail prone tubes won't block all workers), I can set a deadline for a job to finish and if the worker dies in the middle the job will be returned for the other workers. And best of all, I can monitor all tubes nicely with Graphite.

fullmoon · on Sept 23, 2012

Why do you feel that a queue is inherently unpeekable?

pimeys · on Sept 23, 2012

You can peek, but it's not built for looking through all the items in it. It is doable, but not very efficient.

ozataman · on Sept 23, 2012

Do you have a custom solution for monitoring via Graphite? If so, is it something you can share?

pimeys · on Sept 24, 2012

It's pretty straightforward. Mostly we have custom data not related to Beanstalkd to monitor, but basically there's two scenarios:

1. When doing something, increment a counter in Statsd, which will aggregate the values to Graphite. Works well for custom stuff.

2. Write a rake task to poll Beanstalkd with `stats` command, reduce the data and send it straight to Graphite.

I'm really planning to open source the queuing system we're using because it's so simple. It might need a prettier admin view and some nice way of plugging in more monitoring systems before I put it to public.

ericholscher · on Sept 23, 2012

We also use beanstalk a ton where I work. We have put multiple billions of jobs through it over the past few years. It's rock solid and one of my favorite pieces of software.

sandGorgon · on Sept 23, 2012

Why threads instead of Eventmachine ? I'm not an expert in this, but the usual how-to for workers is Eventmachine.

EDIT: It seemed familiar, so I dug up an old thread of yours [1], where we discussed a similar thing. At that time, it seemed you narrowed down your problems to Eventmachine, but were still using Redis. Has that changed ?

[1] http://news.ycombinator.com/item?id=3366470

pimeys · on Sept 23, 2012

I made a gem called em-resque [1] for running Resque workers inside an EventMachine. It was fast, but the lack of good libraries, like Curl::Easy for evented code was a big problem. Also the exceptions EventMachine gave us or the lack of exceptions when missing work items was just plain awful. We spent several months catching these bugs and made even some progress together with nice folks from resque-retry and Resque teams.

Then I had a chance to re-build the system. I couldn't be more happy with it. It scales well, it's pretty fault proof and almost as fast as my EventMachine solution. Threads are slow, but the real slowness comes from the IO. Now instead of em-http-request I have much better working Curl::Easy and the whole codebase is much easier to understand.

[1] https://github.com/SponsorPay/em-resque

robotmay · on Sept 23, 2012

I switched from Resque to Mike Perham's Sidekiq earlier in the year, and the improvement is just sensational. I'm curious to see how Resque can be improved to bring it up to the same level.

lucaspiller · on Sept 23, 2012

++ on Sidekiq. We've been using it since April and have had no issues at all. From the readme:

"Sidekiq is compatible with Resque. It uses the exact same message format as Resque so it can integrate into an existing Resque processing farm. You can have Sidekiq and Resque run side-by-side at the same time and use the Resque client to enqueue messages in Redis to be processed by Sidekiq.

At the same time, Sidekiq uses multithreading so it is much more memory efficient than Resque (which forks a new process for every job). You'll find that you might need 50 200MB resque processes to peg your CPU whereas one 300MB Sidekiq process will peg the same CPU and perform the same amount of work. Please see my blog post on Resque's memory efficiency and how I was able to shrink a Carbon Five client's resque processing farm from 9 machines to 1 machine."

Axsuul · on Sept 23, 2012

This, especially if you're doing any crawling or network-bottlenecked task. Those extra workers that you can fit within your memory footprint is money.

mrzor · on Sept 23, 2012

Hello everyone !

We recently released jobco, a simple Resque distribution. You can check it out at https://github.com/mrzor/jobco

We attempted to address some of the issues we had with our resque 1.x stack regarding productivity (ease to use both in development, and ease to deploy) and regarding API consistency (rewrote resque-status to keep the Resque API intact).

It doesn't address any of the high throughput issues that were mentioned earlier.

I'm definitely up to help fixing stuff within Resque when appropriate. However, the Jobco project is possibly more appropriate if the feature you're looking after can be provided using plugins, or involves providing better CLI interface to Resque.

You can find out my email address using `git log` on the jobco repo.

100k · on Sept 23, 2012

I didn't realize GitHub had abandoned Resque. What are they using now? Or is it in "works for me" status and they don't feel like changing it?

khangtoh · on Sept 23, 2012

Resque should just be left alone and let it retire. SideKiq is the present and future.

mrzor · on Sept 23, 2012

Sometimes the forking model have desirable properties.

Numerous past articles covered this in one way or another.

Choosing between SideKiq and Resque is primarily a matter of choosing between forked processes and threads. Retiring Resque and let SideKiq be the present and the future would destroy that choice.

SideKiq, as a threaded impl., can claim some cool benefits : - shorter job launch times (only interesting if you actually run thousands of second spanning jobs a minute) - reduced memory footprint (you probably have little to no amount of thread local storage)

I will venture that the reduced memory footprint is mostly the result of the unfriendly to copy-on-write GC we have to live with using 1.9.

Resque has a clean and simple API and a decent ecosystem. It can and hopefully will evolve into something cooler and more powerful - even if worthy alternatives exist.

cypherpunks01 · on Sept 23, 2012

This post convinced me to re-think using Resque in a new webapp of mine. I'm thinking about Celery + RabbitMQ now, does anyone have feedback about this or other recommendations? (i'm in a python wsgi stack)

pjscott · on Sept 23, 2012

Celery is very slick and low-effort, and should fit in nicely with the rest of your Python code base. Go for it!

alrs · on Sept 23, 2012

It may read as tangential, but this feels like the beginning of end of "forget free software, let's get Macs."

w1ntermute · on Sept 23, 2012

Someone needs to start a similar project with Ubuntu. Things are clearly getting a bit desperate for them (or so it seems) . If a good number of quality developers finds a good arrangement whereby they can donate a few hours of their time every week to significantly improving it, some real changes could be created.

There are definitely a lot of people who are not happy with the way that Apple has been acting as of late. What if we were to select one (1) laptop model and work on making Ubuntu on it a seamless experience?

alrs · on Sept 23, 2012

The best way to improve Ubuntu is to get involved with Debian. That's where the heavy lifting is.

emillon · on Sept 23, 2012

This. I started to get involved in Debian a few years ago (though I'm a long-time user) and this is really really interesting work. There are a lot of packages with too little workforce where you can go through a list of dozens or hundreds of bugs.

alrs is true ; in most cases your work is automatically replicated to Ubuntu (where the package-maintainer relationship is less strong than in Debian. Most packages don't have a dedicated maintainer there).

And everyone can do it ; no sign-up or complicated process required. You just have to find a sponsor for your packages which is just a fancy name for "every code needs review". Debian is often described as bureaucratic but most of the processes are very sane.

w1ntermute · on Sept 23, 2012

Well, wherever the help is needed. My point is that if we can get a large number of skilled developers involved in improving the code behind Ubuntu through a small weekly commitment, that could really solve a lot of the existing issues.

shardling · on Sept 23, 2012

I don't think there's any actual indication that Ubuntu is in trouble. Sure there is grumbling on the internet when they change UI, but that's pretty par for the course.

w1ntermute · on Sept 23, 2012

I was referring to the addition of advertising to the OS, not the move to Unity.

desas · on Sept 23, 2012

There is 100% hardware compatibility on many laptops already, I'm not sure that that is the problem.

Canonical need to be more profitable, indeed they're probably not or barely profitable yet. They seem to have two ways of increasing profit a) monetising their existing desktop user base and b) Getting more desktop users so that more businesses become comfortable with it and buy business support.

w1ntermute · on Sept 23, 2012

> There is 100% hardware compatibility on many laptops already, I'm not sure that that is the problem.

Then what is the problem exactly? Every time Ubuntu comes up, people ask if it is fully hardware compatible. Is that just a branding issue, where people think Ubuntu isn't hardware compatible with most modern laptops?

desas · on Sept 23, 2012

Hardware compatibility is a problem, having a group of people all targeting one particular laptop to fix is worthwhile but isn't a great way to fix it. There are various laptops that can be pointed to if you're persuading someone buying a new laptop to install ubuntu (though which is poorly documented).

There are multiple problems, the main one is that windows is known, comes "free" and already installed when you buy a computer, is less of a risk and is good enough. Most people will know a "computer guy" who can help them with windows or be comfortable that they can pay someone otherwise.

Software is a bigger problem in some ways than hardware now (though the web is fixing this), note the effort Ubuntu are putting into the software centre, allowing paid for apps, proposing to relax the restrictions to get new software into Ubuntu, running app competitions and so forth.

Selling operating systems isn't easy, look at how long apple have toiled in the wilderness and they have linked hardware, loads of good will via ipod/iphone, billions in the bank and have managed to climb to something like 5%-10% market share.

shardling · on Sept 23, 2012

I am unclear on why you think that is an indication of desperation.

In contrast, Mozilla gets most of its income in an almost identical way (making google a default search provider) and that's not considered desperate.