The hardest thing I find about writing a scraper/miner/api consumer is the HUGE amount of irregular data that you have to check for.
I was attempting to write a crawler for a pretty big dynamic site. It worked, but the code ended up being so messy because of all the weird quirks I had to check for non-stop.
It is quite hard, but I think it is impossible if you are not doing TDD and asserting throughout the process.
Basically, write a whole bunch of rules, fail safes, etc. So for a twitter crawler the tweet should always be there, the username should always be there, and the response should always be more than 400 characters.
Then any response that fails doesn't bring down your app, it just goes into the "needs review pile" and your app continues unless it gets 50 or so of these errors in a row. Then it SMSes or emails you to let you know it shut down.
It isn't like normal programming, it is dirty and any analysis you plan to do has to take the assumption that you will never get all of the data.
In terms of cleaner code, TDD will help here, but yeah, you will need dozens if not hundreds of methods while you're trying to classify a response.
I've done likewise but actually just found that scraping at scale (even with something like EC2) is very expensive - which sort of precludes having any sort of free tier/freemium model around the resulting site.
I would have had to charge my users a lot of money to pay for the kind of scraping required to power my product.
How so? I was able to scrape almost every UK newspaper with a single EC2 instance. It was taking a few hours to work through all the sources, but it was a viable way of gathering a huge amount of data.
Be warned! Google App Engine Pricing has gone up by as much as 500% since Nov 7. The code is based on GAE and you just need be warned that if you spend time and build around it, be ready for some crazy billing. I'm talking from personal experience :(
Yes!!! I was hoping to have pay-per-impression advertisement that would pay the bill but apps which don't offer wordy clutter are not allowed to have ads. I was rejected by adSense, BuySellAds, and adBrite. Sometimes I think we build Internet for robots not for humans.
I have ~100 visits per day. That costs me ~$0.10 per day.
The content is refreshed every 15 min - both database and UI.
Probably the production version of the code will get memcache and some other features. Apart of couple suggestions from the discussion here the educational version of hnpickup will be frozen. Although, any simplifications or correction to the code or documentation are very welcome.
If you click on the question mark in the top right corner you will get the definitions. More documentation is in the source code.
EDIT: As I say in the "warning" section of the readme file: usually it's not a good idea to show the raw data because it will confuse the end user. And if you add more explanation you will cause clutter and cognitive overload. I show the raw data mostly for people that want to dig into the source code.
On the topic of confusion, I think your demo would be clearer if you renamed "News stories bottom average score" to "Front page stories bottom average score". News is just a little too close to Newest, so I had to reread the definitions of both a couple of times. Cool project otherwise!
Pickup ratio suppose to measure your chance of getting from "newest" page to the "news" page. But the more I think about it the more I'm inclined to think it's just a very hot topic detection device.
Pickup ratio > 1 means that the "newest" page has couple very interesting stories that are also seen on the "news" page.
Pickup ratio > 2 means that there is some sort of hysteria on HN about some particular topic ...
As an analyst who wants to start playing with making web apps and has no idea where to start, this is pretty much exactly what I was looking for. Thank you!
I just use it offline, on my laptop. The UI is just a nice frontend to an idea that you know will work. Then you might have a ton of ideas that you want to check before publishing. One question may want to answer is if the behavior on HN predictable?
Google App Engine doesn't support R but you can use NumPy:
I was attempting to write a crawler for a pretty big dynamic site. It worked, but the code ended up being so messy because of all the weird quirks I had to check for non-stop.