Abusing Amazon Images

stephenjudkins · on Sept 30, 2009

This approach to generating images generates a lot of benefits when you're working at scale. Before, when we wanted an image that's been thumbnailed, cropped, or modified a certain way, we would generate the image (or check if it exists) at page render time, and pass on the URL. This meant that page rendering blocked either checking if the image exists or generating a new image. Further, when our flaky file store starting freaking out it meant the entire site went down.

By encoding all the relevant information in the URL, like Amazon does, we offload such work asynchronously to different web requests. If there's info in this URL we want hidden from users (spam-protected email addresses) we encrypt it using Base64-encoded AES. If the file store were to go down--which hasn't happened since we switched to using S3--users would only see broken images instead of the entire site crashing. Further, we store the images on S3 using the exact same URL as we fetch it from, and have our static web server check there first. That means if the image is cached it never hits our application stack.

Now, when we generate a page with one of these dynamic images, our app stack only has to generate a URL instead of relying on the vagaries of our file store and image processing libraries. It's made our site more reliable, easier to manage, and faster. If we suddenly start serving up many more images, we could easily replace our image generation service with something higher-performance or more reliable, since it's just a web service. Right now, it works great.

Clearly Amazon has seen the benefits of the same approach.

jacquesm · on Sept 30, 2009

I use a really really dirty hack for this exact problem, I handle the rescaling in the 404 handler.

It's the laziest possible approach and it has one huge hidden benefit, which is that if a bot hits your pages you're not going to end up pre-generating a bunch of scaled images that nobody is ever going to use.

A simple list of allowed sizes in the handler makes sure that someone can not use URL injection to generate a large series of nonsense images for sizes that we'd normally not use.

Typically a url looks like http://host/resourceid_widthxheight.jpg , where width and height are replaced by the desired size.

I'll be the first to admit this is a nasty bit of code but I haven't found anything else that comes close in terms of efficiency and flexibility.

thwarted · on Oct 1, 2009

I don't know why so many people consider this to be a "dirty hack". It makes total sense to overload ErrorDocument and generate the missing content. You get really good efficiency in the common case of the document existing because it is in the filesystem and execution never leaves apache core to run a CGI except in the exceptional case of having to generate it (depending on your usage patterns, of course).

This is even documented in the apache docs, albeit with rewrite rules, which is more heavyweight and less efficient. http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide_advan... I believe that using ErrorDocument for this purpose is explicitly mentioned somewhere in the apache docs, but I can't find it right now. I'm pretty sure that' where I first learned of it (maybe back in the 1.3 days).

I have a feeling that if ErrorDocument had been named (or aliased to) GenerateMissingContent, then this would be a widely accepted method of dynamic on the fly generation with filesystem caching. I've even heard a rationale that having a script generate the content on error and write it into the filesystem where apache can then read it directly on later requests is reimplementing a reverse caching proxy, and you should just stick squid in front of your web servers (with the, ahem, additional overhead of having to run and mange two services). This seems like significant effort of trying to avoid something that has "error" in the name for non-error cases.

jacquesm · on Oct 1, 2009

If there would have been a more appropriately named hook such as 'about_to_404_last_chance_to_make_content' I'm sure I would have used that instead :)

The resulting image is indeed written to the filesystem, right next to the non-scaled one (so when we remove we can remove all of them in one go without having to visit another directory).

The really nasty trick is that when all is done I redirect the browser to the same url and this time I'm sure it will find it.

And if the source file is missing it really does 404, of course.

The only tricky situation is when you have a whole pile of people requesting the same image at the same time and it hasn't been saved yet, the first party to do the 'fopen' gets dibs on doing the transform, everybody else simply redirects if the file exists, by the time the browsers have processed the 3xx the file is ready for consumption.

On very rare occasions the race is so close that the image gets rescaled and saved twice, I've logged the occurence of this over many millions of rescales and it only happened a handful of times, not worth improving on.

thwarted · on Oct 1, 2009

The really nasty trick is that when all is done I redirect the browser to the same url and this time I'm sure it will find it.

This is a pretty nasty trick (seems like a risk of a redirect loop if the file can't be written), but unnecessary. You can emit content out of the script used for ErrorDocument the same as you can with a CGI. So you generate the content, write it to the filesystem, emit a "Status: 200" header, which replaces the 404 status apache was going to generate, then send a content-type and the generated content in the response body.

Your way may be easier if you have complex cache control or expires headers that apache is configured to ad to static files but not dynamically generated content, since the client is only ever given the file by apache directly from the filesystem. I've never used that though.

drusenko · on Oct 1, 2009

we use this exact approach to generate all of our blog pages. incredibly efficient.

the twist we use to handle race conditions is to create a lock file. if the 404 handler is called and the lock file exists, someone else is generating that page as we speak -- it goes into a timed loop to check for the file being available. when found, it serves it directly with the 200 header, like you mentioned.

we have a maximum loop value (a second, i think) -- if you hit that, it means the other process died for some reason, and never deleted the lock or the page. in that case, generate and write the page yourself, and remove the lock. all future requests are served off of the file system.

one nice part of this caching mechanism is that to remove the cache, you just rm -rf the blog directory. tada!

jacquesm · on Oct 1, 2009

No risk of a loop, the file is guaranteed to exist when it's done, the redirect happens just after the fclose on the 'write' (and only if the file is written successfully).

That way the browser will see the exact same headers it would on a request where the file did exist.

mrkurt · on Sept 30, 2009

We did something similar when we were running our own home grown publishing system, though it was only for sizing purposes (no fund stuff like discount badges).

If you go this route, be very careful making the URLs too human readable. When it's obvious that you can hammer on URLs to use server resources, it becomes a natural vector for a denial of service. If we went this route again, I'd go one step further than encrypting sensitive data, and encrypt anything that can somehow result in a new image.

Unfortunately, doing this safely sorta kills the user friendliness of it. It was fantastic to be able to tell authors "if you want a 100x100 image, add /100w/100h/ before the file name". I think they'd be less inclined to respond to "if you want a 100x100 image, generate a hash from the width x height x secret key and add /100w/100h/the-hash/ before the image".

jacquesm · on Oct 1, 2009

We included a simple list of permitted sizes to protect against that. The velocity controls will block out a bot in a heartbeat anyway, but it seemed like a good thing to have.

patio11 · on Oct 1, 2009

This approach might be totally obvious to many people here, but its quick to implement, it performs well, and it scales impressively:

1) The web server checks to see if the image exists in a file store and, if it does, serves it directly.

2) If not, the request goes to your web application stack (Rails, etc) which fires off an ImageMagick process to do whatever magick the image requests, operating on either a base image or something which you magick up at request time.

3) Your application twiddles its thumbs for a second or two. (This has potentially negative consequences for other requests coming into the same mongrel, at least in Rails. Consider doing load balancing which is aware of the mongrel being tied up -- I think Pound does this fairly easily.)

4) After you've got the image, save it to the file store and tell the web server "Send the user the file at this path".

5) The next request for the same URL will hit the webserver, find the file on disk, and stream it automatically.

In my case I use a cron script every day to blow away all the images that haven't been accessed in the last X days, but depending on your needs it might be easier to just persist them indefinitely. (I use GIFs to give users a live preview of the PDF they're essentially editing, and generate one every second, so if I didn't do this I would drown in half-written bingo cards. My users go through a couple of gigs a week 20kb at a time.)

JacobAldridge · on Oct 1, 2009

Patrick, your observations, awareness, and ability / willingness to communicate them clearly continue to impress me. Cheers.

patio11 · on Oct 1, 2009

Thanks! Writing in English helps me take refreshing little breaks from, e.g., trying to figure out what the single comment left in a 3,000 line source file written in Japanese by our Korean outsourcing team means.

(Answer: it means today will be a long, long day.)

emmett · on Oct 1, 2009

Re point 3: Don't use Pound. Use HAProxy, it's superior along pretty much every dimension. The idea is right though :-).

mcantor · on Sept 30, 2009

Amazon should be paying this guy for documenting their codebase for them.

martey · on Oct 1, 2009

Because Amazon has no internal documentation about how to create product images? Or because Amazon's terms of services explicitly prevent doing anything useful with these images?

chrischen · on Oct 1, 2009

Because this guy was very thorough in detailing how to manipulate amazon's dynamic image generator.

chaosmachine · on Sept 30, 2009

See also: http://www.gertler.com/nat/abusedimages.html

joshwa · on Sept 30, 2009

See also: http://www.scene7.com/solutions/dynamic_imaging.asp - used by many huge e-commerce sites (including my >$1B employer).

windsurfer · on Sept 30, 2009

Blurring a large image is really CPU intensive. Someone could DOS amazon using a blur URL.

NathanKP · on Oct 1, 2009

It would probably cache the image. So a real DOS would require blurring multiple images at multiple levels.

windsurfer · on Oct 1, 2009

You could just increment the blur amount. Or set a random amount on distributed clients.

aarongough · on Sept 30, 2009

IMDB uses a similar system for retrieving/resizing their images (not wholly surprising as they are owned by Amazon)...

IMDB's format for resizing images: [server hostname]/images/M/[image identifier]._V1._SX[width]_SY[HEIGHT]_.[format]

eg: http://ia.media-imdb.com/images/M/MV5BNzEwNzQ1NjczM15BMl5Ban...

Unfortunately I didn't have any luck wringing meaning out of their image identification strings...

I also found out rather late in the game that they use an interesting cookie based scheme for defeating people trying to link to their images from external sites...

bcl · on Sept 30, 2009

For http://www.movielandmarks.com I ended up caching the DVD images locally and serving them up myself. Some of the URL's served up by AWS go away and I don't re-request them after the initial add of a movie.

I'm going to have to read this article in more detail and see if it can clean that up for me and move image serving back to Amazon.

If you are going to use Amazon's images, why not sign up for an associate account and provide links back to the product pages so you can generate some income from it.

mrkurt · on Sept 30, 2009

You should be careful caching images and other data, since it's (or was) a TOS violation. :)

albemuth · on Sept 30, 2009

a ruby dsl library should pop out any moment on github

qeorge · on Sept 30, 2009

phpThumb[1] + mod_rewrite is a simple way to get up and running with a similar cache of your own. It supports ImageMagick so you've got a lot of options.

[1] http://phpthumb.sourceforge.net/

releasedatez · on Sept 30, 2009

Thanks for sharing this. It'll help me a lot.

liuliu · on Sept 30, 2009

I don't understand why Amazon has to make image cropping/resizing thing on fly. Good image resizing algorithm(cubic/seam curving) is time consuming. Even good algorithms(poisson?) for image modification(add logo) is computational expensive. Just don't get the logic why storing image is expensive than computing one.

tlrobinson · on Sept 30, 2009

Surely they cache the results?

It seems pretty smart to me. Rather than having some external process that tries to figure out every permutation of image that might be necessary they just have a simple HTTP API that always gives you exactly the image you need.

As long as they're caching the image after the first request I don't see any problem.

dangoldin · on Sept 30, 2009

I imagine that because they have such a large volume of images and different combinations (20% off in one country, 30% off in another, ..) it makes sense for them to create images only when requested. They probably cache the generated images though.

In addition, I think that any given day a small percentage of Amazon's products are viewed so it may not make sense to generate an image for every possible combination.

They may also have a longer term view and plan on using this graphic manipulation in another product.

erydo · on Sept 30, 2009

I'm sure their servers are caching the images after generation (or perhaps after 'n' requests or some other optimization in that vein).

Just because the URL describes the operations performed on the image doesn't necessarily mean they always cause those operations to be performed each request.

yan · on Sept 30, 2009

"efficiency" is a vector, not a scalar. I think rolling out deals, new products, and promotions is made a lot cleaner by having that be generative. There's definitely a lot of value to having images assigned to products and not to products+transient properties like discounts.

coliveira · on Oct 1, 2009

The issue is that generating all possible images would tie a lot more resources than generating them on the fly -- only for the combinations that are requested. Also, caching will help if the image is requested multiple times.

coliveira · on Oct 1, 2009

A meta-comment here: why down vote someone that is asking a legitimate question?