Hacker News new | past | comments | ask | show | jobs | submit login

So, when I was creating Adonomics I considered using scRUBYt for scraping Facebook. Here's why I didn't go with it:

1. It was hard to get scRUBYt to learn the "correct" rules. It tends to be over-specific or over-broad.

2. It was slow. Really slow. Using Ruby Mechanize was at least 2-3x faster, and even that was pretty slow.

3. The learner doesn't like bad HTML, but as a practical matter you have to deal with poor markup all the time. scRUBYt makes it hard to get to the guts of the system.

YMMV.




I wrote a similar tutorial a while back using Python and BeautifulSoup (http://lethain.com/entry/2008/aug/10/an-introduction-to-comp...). BeautifulSoup doesn't learn in any sense of the word, but it plays very nicely with malformed (even extraordinarily malformed) html, and you can usually do things in a way that is resistant to changes (a combination of tag and id|class is usually fairly resistant to non-drastic changes).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: