Hacker News new | past | comments | ask | show | jobs | submit login
How to Scrape Websites in Ruby on Rails using scRUBYt (dmix.ca)
16 points by dmix on Sept 7, 2008 | hide | past | favorite | 6 comments



When I did http://zerodaydeals.com (not mine anymore) the biggest pain was scraping all the sites, since there was 50+ sites... and people would constantly change layouts, breaking it.

I tried using plugins but ended up resorting to pile of regular expressions for each site. I wonder if this would be better as I don't think it was around at the time.


[shameless plug] Our startup - Feedity - (http://feedity.com) provides custom RSS feeds for virtually any webpage, which helps many small-medium online services in data integration.


So, when I was creating Adonomics I considered using scRUBYt for scraping Facebook. Here's why I didn't go with it:

1. It was hard to get scRUBYt to learn the "correct" rules. It tends to be over-specific or over-broad.

2. It was slow. Really slow. Using Ruby Mechanize was at least 2-3x faster, and even that was pretty slow.

3. The learner doesn't like bad HTML, but as a practical matter you have to deal with poor markup all the time. scRUBYt makes it hard to get to the guts of the system.

YMMV.


I wrote a similar tutorial a while back using Python and BeautifulSoup (http://lethain.com/entry/2008/aug/10/an-introduction-to-comp...). BeautifulSoup doesn't learn in any sense of the word, but it plays very nicely with malformed (even extraordinarily malformed) html, and you can usually do things in a way that is resistant to changes (a combination of tag and id|class is usually fairly resistant to non-drastic changes).


I took at look at scRUBYt, looked nice, but I ended up just using hpricot - fast, and pretty easy. I would just have one screen with the site with firebug open, and grab the xpath expressions from it, slap them in a ruby string and then put in place holders for the parameters.

Only minutes of work.


Nice work, script is a lot simpler than i thought.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: