Hacker News new | past | comments | ask | show | jobs | submit login
A Little Web Spider I Wrote Last Night (code included) (trailbehind.com)
24 points by andrewljohnson on March 12, 2009 | hide | past | favorite | 7 comments



Here is a web spider in 1 line: perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { print STDERR "working on $link";HTML::LinkExtor->new( sub { my ($t, %a) = @_; my @links = map { url($_, $link)->abs() } grep { defined } @a{qw/href img/}; print STDERR "+ $_" foreach @links; push @ARGV, @links} )->parse(do { my $r = $ua->simple_request (HTTP::Request->new("GET", $link)); $r->content_type eq "text/html" ? $r-> content : ""; } ) }' http://www.google.com



I tested the above script - It works. You are good.


thanks, i didn't write it. I remembered that in some magazine from 1999. The link to the author is in the comment below


Here's a little web spider I just wrote: "wget -r".

I haven't got a problem with people writing their own quickie scripts, but they aren't really worth putting online without some other compelling reason.


It's hard to tell if you're after learning or something to use etc., but if you haven't seen it yet, check out http://scrapy.org/ for ideas (or even something to use if that is what you are after).


For completeness, and to keep with the recent meme around here, here's a web spider I wrote a while ago in Erlang.

http://github.com/michaelmelanson/spider




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: