How to build a data mining web app?

tibbon · on Dec 2, 2011

The hardest thing I find about writing a scraper/miner/api consumer is the HUGE amount of irregular data that you have to check for.

I was attempting to write a crawler for a pretty big dynamic site. It worked, but the code ended up being so messy because of all the weird quirks I had to check for non-stop.

3pt14159 · on Dec 3, 2011

It is quite hard, but I think it is impossible if you are not doing TDD and asserting throughout the process.

Basically, write a whole bunch of rules, fail safes, etc. So for a twitter crawler the tweet should always be there, the username should always be there, and the response should always be more than 400 characters.

Then any response that fails doesn't bring down your app, it just goes into the "needs review pile" and your app continues unless it gets 50 or so of these errors in a row. Then it SMSes or emails you to let you know it shut down.

It isn't like normal programming, it is dirty and any analysis you plan to do has to take the assumption that you will never get all of the data.

In terms of cleaner code, TDD will help here, but yeah, you will need dozens if not hundreds of methods while you're trying to classify a response.

bengl3rt · on Dec 3, 2011

I've done likewise but actually just found that scraping at scale (even with something like EC2) is very expensive - which sort of precludes having any sort of free tier/freemium model around the resulting site.

I would have had to charge my users a lot of money to pay for the kind of scraping required to power my product.

semanticist · on Dec 4, 2011

How so? I was able to scrape almost every UK newspaper with a single EC2 instance. It was taking a few hours to work through all the sources, but it was a viable way of gathering a huge amount of data.

salimmadjd · on Dec 2, 2011

Be warned! Google App Engine Pricing has gone up by as much as 500% since Nov 7. The code is based on GAE and you just need be warned that if you spend time and build around it, be ready for some crazy billing. I'm talking from personal experience :(

zeratul · on Dec 2, 2011

Yes!!! I was hoping to have pay-per-impression advertisement that would pay the bill but apps which don't offer wordy clutter are not allowed to have ads. I was rejected by adSense, BuySellAds, and adBrite. Sometimes I think we build Internet for robots not for humans.

I have ~100 visits per day. That costs me ~$0.10 per day.

More details here:

https://github.com/entaroadun/hnpickup/wiki/Hacker-News-Pick...

groby_b · on Dec 3, 2011

Looking at the details:

3.8 million data store reads for 8,000 visitors seems high. Especially for the kind of data you are serving - it's not rapidly changing.

Are you using memcache at all? A quick scan of the source doesn't show any memcache imports, but I just might've missed it. (http://code.google.com/appengine/docs/python/memcache/usingm...)

zeratul · on Dec 3, 2011

The content is refreshed every 15 min - both database and UI.

Probably the production version of the code will get memcache and some other features. Apart of couple suggestions from the discussion here the educational version of hnpickup will be frozen. Although, any simplifications or correction to the code or documentation are very welcome.

zeratul · on Dec 2, 2011

Just got an email from Kontera ads:

  Denial Reason: Insufficient Text.

What happen to saying: "picture worth thousand words"?

lurker17 · on Dec 2, 2011

Put current HN headlines on the page to generate content for ads.

sophiebits · on Dec 2, 2011

A tip: if you rename README.txt to README.markdown and change the headers to look like:

  # Section Header

  Paragraph text.

then GitHub will render it nicely.

zeratul · on Dec 2, 2011

Thank you.

btbytes did the md formatting for me. I will accept the pull request as soon as I get home.

mapleoin · on Dec 2, 2011

I must be stupid, but I just can't wrap my head around this three-axis-in-a-2D-space graph: http://hnpickup.appspot.com/

Is this normal?

zeratul · on Dec 2, 2011

If you click on the question mark in the top right corner you will get the definitions. More documentation is in the source code.

EDIT: As I say in the "warning" section of the readme file: usually it's not a good idea to show the raw data because it will confuse the end user. And if you add more explanation you will cause clutter and cognitive overload. I show the raw data mostly for people that want to dig into the source code.

eric-hu · on Dec 2, 2011

On the topic of confusion, I think your demo would be clearer if you renamed "News stories bottom average score" to "Front page stories bottom average score". News is just a little too close to Newest, so I had to reread the definitions of both a couple of times. Cool project otherwise!

lurker17 · on Dec 2, 2011

Actually, clicking anywhere on the graph will flip to the definitions, which is very bad UX.

eschutte2 · on Dec 2, 2011

Maybe think of it as two different graphs with the same x-axis, superimposed?

rossmasters · on Dec 2, 2011

That's all this is really. The left Y axis applies to the sets that involve an average score, the right Y axis applies to pickup ratio only.

That said, what is a pickup ratio? Some sort of measure of extra votes over time?

zeratul · on Dec 2, 2011

Pickup ratio suppose to measure your chance of getting from "newest" page to the "news" page. But the more I think about it the more I'm inclined to think it's just a very hot topic detection device.

Pickup ratio > 1 means that the "newest" page has couple very interesting stories that are also seen on the "news" page.

Pickup ratio > 2 means that there is some sort of hysteria on HN about some particular topic ...

SatvikBeri · on Dec 3, 2011

As an analyst who wants to start playing with making web apps and has no idea where to start, this is pretty much exactly what I was looking for. Thank you!

ivan_ah · on Dec 3, 2011

In case you missed it in 2.dm/2-dm_offline.r

  #!/usr/bin/R-2.10/bin/Rscript
  ...

You can call R as a script. Wicked!

Do you run this as a cron job, or is there some hook from python?

zeratul · on Dec 3, 2011

I just use it offline, on my laptop. The UI is just a nice frontend to an idea that you know will work. Then you might have a ton of ideas that you want to check before publishing. One question may want to answer is if the behavior on HN predictable?

Google App Engine doesn't support R but you can use NumPy:

http://code.google.com/appengine/docs/python/python27/using2...

http://numpy.scipy.org/

jianxioy · on Dec 2, 2011

This is a really cool project. Nice work! :)

P.S. It seems like you're following me on GitHub. Makes me wonder if I know you. Email me to introduce yourself! :)

rhnet · on Dec 2, 2011

Thanks for linking to my Stack Overflow by Day post in the readme!

ColdAsIce · on Dec 2, 2011

This is gold! Thank you so much!