Generating Thousands of PDFs on EC2 with Ruby

bensummers · on Dec 23, 2009

It would have taken 30 hours to generate the files without using EC2. I wonder how long it took to set up everything to do them in an hour, and whether it was worth it. It seems quite a complex set of tools for a simple task.

Perhaps just splitting the 3600 files into four sets and starting them off manually on manually created instances might have been quicker overall?

seancribbs · on Dec 23, 2009

The client was slow in delivering the data and we needed it within 48 hours, with the possibility of generating the output multiple times to discover any errors or other issues.

almost · on Dec 23, 2009

That was my first thought as well. But I guess it depends on how often they need to update things. It might be quite nice to be able to deploy design tweaks in an hour rather than 30.

But if this was a one off or very infrequent then I think you're probably right. Just leaving it running over a weekend might have been better. But then they wouldn't have got to play with the shiny new tools :)

zacharypinter · on Dec 24, 2009

I'm really looking forward to the Chef Platform mentioned in this article. It seems like chef is a set of great core tools, but putting them all together and getting an environment going is still more work than it should be.

The two times I've used chef (first chef-solo, then setting up a chef-server) ended up being far more work than just setting up a server manually (haven't had to scale out tons of servers yet, which I assume makes the current chef setup time worthwhile).

I'd like to see chef evolve into something more like Heroku, where I setup a few config files and then run a simple command line script.

The benefit of Chef Platform is that it would (hopefully) be a simple monthly fee (like say $10/month) instead of fee on top of every server resource I use. Also, I like the idea of creating my own recipes (instead of relying on Heroku to have an addon, which is problematic when they lack the addon for something like MongoDB).

hallmark · on Dec 28, 2009

RightScale is a service that manages cloud deployments - they started with EC2 and have expanded much beyond that. They started using Chef recently.

http://blog.rightscale.com/2009/09/16/rackspace-rightlink-ch...

Although they are quite expensive for a small shop, if you're interested in deploying within Amazon's cloud and have some money to throw at the problem, RightScale is definitely worth a look. You can get pretty far with just their free developer accounts - as I've done - but the more powerful Chef functionality comes online once you become a paid customer.

brown9-2 · on Dec 23, 2009

The NYTimes did something similar with EC2 and Hadoop to generate a PDF for each article from published between 1851 and 1980 from 4TB of TIFF files: http://open.blogs.nytimes.com/2007/11/01/self-service-prorat...

bbgm · on Dec 23, 2009

Aron Pilhofer at DocumentCloud also comes from the NY Times and their CloudCrowd project is also great for this purpose

http://wiki.github.com/documentcloud/cloud-crowd

lsb · on Dec 23, 2009

They've been working on a website for 2 months, but they can't wait 1 day for pdfs? That sounds silly.

seancribbs · on Dec 23, 2009

This is the reality of many projects. We got the full dataset around 48 hours before the output was needed. See my comment above too.

MartinMond · on Dec 23, 2009

I'm missing the one thing that's actually interesting (read hard) How did they collect the generated files? Did they take any precautions in case a worker died while creating a pdf? (Does AMPQ have some sort of transactional semantics? If so, how would one apply them)

almost · on Dec 23, 2009

Amazons own SQS service has a simple solution to this problem. When you get an item from the queue it is temporarily hidden then once you've processed it successfully you explicitly delete the item from the queue. If the item is not deleted within the set timeout period then it goes back into the queue ready for another worker to have a go at. I assume RabbitMQ has something similar.

Actually I'm not sure why they used RanbitMQ rather than SQS for this task. SQS seems to me to be a perfect fit and there's no extra effort involved to set it up.

seancribbs · on Dec 23, 2009

We uploaded the generated files to S3 at the end of each individual job. We could have used AMQP's 'ack' for each message, but it turned out not to be necessary.

jmtame · on Dec 23, 2009

EC2 isn't bad. Under Rails, I use Rio to store a lot of remotely fetched images (it's one line to fetch a remote image in Rio), HTTParty to handle some JSONP, and EC2 bandwidth is free when using Heroku. Nice setup, the only disadvantage is that Heroku's file systems are read-only, so you have to write temporarily to /tmp, and then again to EC2.