A different issue, but one that's burned me several times is when things go wron...

A different issue, but one that's burned me several times is when things go wrong with resque (or sidekiq), life goes to hell pretty quickly because redis needs to fit everything in memory. E.g., if a third party resource goes down and jobs hitting it all of a sudden start exceptioning and get thrown into a retry queue, all those stacktraces sitting around in memory accumulate very quickly. redis fills up, the OS kills it, it starts back up, loading an older DB, and then restarts processing all those jobs that just failed, without an exponential backoff. And this process will basically repeat ad nauseum.

I'm not sure a whole lot can be done, other than resurrecting the old debate over something like diskstore. And while EC2 is memory-constrained, dealing with this situation isn't a matter of just adding more memory, because that will be filled up as well. redis would either need to be able to spill over onto disk or the processing model would need to be changed. We ended up doing the latter, eschewing stacktraces for retried jobs, adding monitoring to disable queues that have a very high failure rate, etc.