It would have taken 30 hours to generate the files without using EC2. I wonder how long it took to set up everything to do them in an hour, and whether it was worth it. It seems quite a complex set of tools for a simple task.
Perhaps just splitting the 3600 files into four sets and starting them off manually on manually created instances might have been quicker overall?
The client was slow in delivering the data and we needed it within 48 hours, with the possibility of generating the output multiple times to discover any errors or other issues.
That was my first thought as well. But I guess it depends on how often they need to update things. It might be quite nice to be able to deploy design tweaks in an hour rather than 30.
But if this was a one off or very infrequent then I think you're probably right. Just leaving it running over a weekend might have been better. But then they wouldn't have got to play with the shiny new tools :)
I'm really looking forward to the Chef Platform mentioned in this article. It seems like chef is a set of great core tools, but putting them all together and getting an environment going is still more work than it should be.
The two times I've used chef (first chef-solo, then setting up a chef-server) ended up being far more work than just setting up a server manually (haven't had to scale out tons of servers yet, which I assume makes the current chef setup time worthwhile).
I'd like to see chef evolve into something more like Heroku, where I setup a few config files and then run a simple command line script.
The benefit of Chef Platform is that it would (hopefully) be a simple monthly fee (like say $10/month) instead of fee on top of every server resource I use. Also, I like the idea of creating my own recipes (instead of relying on Heroku to have an addon, which is problematic when they lack the addon for something like MongoDB).
Although they are quite expensive for a small shop, if you're interested in deploying within Amazon's cloud and have some money to throw at the problem, RightScale is definitely worth a look. You can get pretty far with just their free developer accounts - as I've done - but the more powerful Chef functionality comes online once you become a paid customer.
I'm missing the one thing that's actually interesting (read hard)
How did they collect the generated files?
Did they take any precautions in case a worker died while creating a pdf? (Does AMPQ have some sort of transactional semantics? If so, how would one apply them)
Amazons own SQS service has a simple solution to this problem. When you get an item from the queue it is temporarily hidden then once you've processed it successfully you explicitly delete the item from the queue. If the item is not deleted within the set timeout period then it goes back into the queue ready for another worker to have a go at. I assume RabbitMQ has something similar.
Actually I'm not sure why they used RanbitMQ rather than SQS for this task. SQS seems to me to be a perfect fit and there's no extra effort involved to set it up.
We uploaded the generated files to S3 at the end of each individual job. We could have used AMQP's 'ack' for each message, but it turned out not to be necessary.
EC2 isn't bad. Under Rails, I use Rio to store a lot of remotely fetched images (it's one line to fetch a remote image in Rio), HTTParty to handle some JSONP, and EC2 bandwidth is free when using Heroku. Nice setup, the only disadvantage is that Heroku's file systems are read-only, so you have to write temporarily to /tmp, and then again to EC2.
Perhaps just splitting the 3600 files into four sets and starting them off manually on manually created instances might have been quicker overall?