How to Scrape Websites in Ruby on Rails using scRUBYt

ejs · on Sept 7, 2008

When I did http://zerodaydeals.com (not mine anymore) the biggest pain was scraping all the sites, since there was 50+ sites... and people would constantly change layouts, breaking it.

I tried using plugins but ended up resorting to pile of regular expressions for each site. I wonder if this would be better as I don't think it was around at the time.

nreece · on Sept 8, 2008

[shameless plug] Our startup - Feedity - (http://feedity.com) provides custom RSS feeds for virtually any webpage, which helps many small-medium online services in data integration.

jfarmer · on Sept 7, 2008

So, when I was creating Adonomics I considered using scRUBYt for scraping Facebook. Here's why I didn't go with it:

1. It was hard to get scRUBYt to learn the "correct" rules. It tends to be over-specific or over-broad.

2. It was slow. Really slow. Using Ruby Mechanize was at least 2-3x faster, and even that was pretty slow.

3. The learner doesn't like bad HTML, but as a practical matter you have to deal with poor markup all the time. scRUBYt makes it hard to get to the guts of the system.

YMMV.

lethain · on Sept 7, 2008

I wrote a similar tutorial a while back using Python and BeautifulSoup (http://lethain.com/entry/2008/aug/10/an-introduction-to-comp...). BeautifulSoup doesn't learn in any sense of the word, but it plays very nicely with malformed (even extraordinarily malformed) html, and you can usually do things in a way that is resistant to changes (a combination of tag and id|class is usually fairly resistant to non-drastic changes).

michaelneale · on Sept 8, 2008

I took at look at scRUBYt, looked nice, but I ended up just using hpricot - fast, and pretty easy. I would just have one screen with the site with firebug open, and grab the xpath expressions from it, slap them in a ruby string and then put in place holders for the parameters.

Only minutes of work.

nickvn7 · on Sept 7, 2008

Nice work, script is a lot simpler than i thought.