For certain things, thumbnails make sense: online, articles about photography, etc. But for regular articles, instead of figuring out how to extract thumbnails, we should realize that in most cases, the article would be highly improved by a lack of thumbnail.
There's a trend lately to illustrate low-quality content with low-quality stock pictures (most likely acquired from a Google Image search without a proper license.) For an example, just look at TechCrunch or Pando. We should strive to rid the internet of this plague.
Good articles and real journalism have standards when it comes to illustration. Open up nytimes.com and look at what's illustrated with photos versus illustrations versus nothing.
Sure, but that can be done with good layout and typography, or tasteful illustrations (à-la New Yorker). Simply using bad filler stock images is a really bad way to go about this.
"Unlike article extraction, it doesn’t seem anyone anywhere has ever put a lot of thought into getting thumbnails out of a website."
Incorrect. Diffbot does a visual analysis of the page to determine the best thumbnail.
[edit: I also get the impression that Prismatic does intelligent grokking of the thumbnail image, especially because I know the team, but I'm not aware of anything they published about their methodology.]
I just wen't through this exact exercise for a project I'm working on where I want to figure out the best image to display from a given craigslist listing.
I used an approach most similar to Goose where I download the image to get the meta data, then get rid of odd aspect ratio images (I think I have it set to anything with bigger than a 3.0 aspect ratio, but it needs to be tweaked). I also get rid of things like 1px wide images (or anything smaller than the thumbnail I want to display).
So far it works "okay". It's far from perfect, but its WAY better than nothing.
I built fetchful.com a while ago which is an attempt at this (as well as generating preview text). After a lot of testing, and a few hundred thousand generated previews, it can be quite hard to get consistent results for thumbnail, it obviously is very simple if developers plan for this and use appropriate metadata tags for their content.
The news aggregators use this ambiguity to their advantage -- plenty of times I've seen a innocuous headline shown with a bikini girl thumbnail because that image was a sidebar gallery preview on the source page (or sometimes even an ad!). Any guesses what effect this has on click-thrus?
There's a trend lately to illustrate low-quality content with low-quality stock pictures (most likely acquired from a Google Image search without a proper license.) For an example, just look at TechCrunch or Pando. We should strive to rid the internet of this plague.
Good articles and real journalism have standards when it comes to illustration. Open up nytimes.com and look at what's illustrated with photos versus illustrations versus nothing.