The hardest thing I find about writing a scraper/miner/api consumer is the HUGE ...

3pt14159 · on Dec 3, 2011

It is quite hard, but I think it is impossible if you are not doing TDD and asserting throughout the process.

Basically, write a whole bunch of rules, fail safes, etc. So for a twitter crawler the tweet should always be there, the username should always be there, and the response should always be more than 400 characters.

Then any response that fails doesn't bring down your app, it just goes into the "needs review pile" and your app continues unless it gets 50 or so of these errors in a row. Then it SMSes or emails you to let you know it shut down.

It isn't like normal programming, it is dirty and any analysis you plan to do has to take the assumption that you will never get all of the data.

In terms of cleaner code, TDD will help here, but yeah, you will need dozens if not hundreds of methods while you're trying to classify a response.

bengl3rt · on Dec 3, 2011

I've done likewise but actually just found that scraping at scale (even with something like EC2) is very expensive - which sort of precludes having any sort of free tier/freemium model around the resulting site.

I would have had to charge my users a lot of money to pay for the kind of scraping required to power my product.

semanticist · on Dec 4, 2011

How so? I was able to scrape almost every UK newspaper with a single EC2 instance. It was taking a few hours to work through all the sources, but it was a viable way of gathering a huge amount of data.