Web crawler using perl

I want to develop a web crawler that starts with a start URL and then crawls 100 html pages that it considers to belong to the same domain as the start URL, and also records drag and drop URLs, avoiding duplicates . I wrote the following, but the value of $ url_count does not seem to increase, and the URLs found contain links even from other domains. How can i solve this? Here I inserted stackoverflow.com as the source URL.

use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; ##open file to store links open my $file1,">>", ("extracted_links.txt"); select($file1); ##starting URL my @urls = 'http://stackoverflow.com/'; my $browser = LWP::UserAgent->new('IE 6'); $browser->timeout(10); my %visited; my $url_count = 0; while (@urls) { my $url = shift @urls; if (exists $visited{$url}) ##check if URL already exists { next; } else { $url_count++; } my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); if ($response->is_error()) { printf "%s\n", $response->status_line; } else { my $contents = $response->content(); $visited{$url} = 1; @lines = split(/\n/,$contents); foreach $line(@lines) { $line =~ m@ (((http\:\/\/)|(www\.))([az]|[AZ]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g; print "$1\n"; push @urls, $$line[2]; } sleep 60; if ($visited{$url} == 100) { last; } } } close $file1; 
+6
source share
1 answer

A few points, your analysis of URLs is fragile; you certainly won’t get relative links. Also, you do not check 100 links, except for 100 matches of the current URL, which almost certainly does not mean what you mean. Finally, I am not very familiar with LWP, so I will give an example using the Mojolicious toolkit.

It seems to work, maybe it will give you some ideas.

 #!/usr/bin/env perl use strict; use warnings; use Mojo::UserAgent; use Mojo::URL; ##open file to store links open my $log, '>', 'extracted_links.txt' or die $!; ##starting URL my $base = Mojo::URL->new('http://stackoverflow.com/'); my @urls = $base; my $ua = Mojo::UserAgent->new; my %visited; my $url_count = 0; while (@urls) { my $url = shift @urls; next if exists $visited{$url}; print "$url\n"; print $log "$url\n"; $visited{$url} = 1; $url_count++; # find all <a> tags and act on each $ua->get($url)->res->dom('a')->each(sub{ my $url = Mojo::URL->new($_->{href}); if ( $url->is_abs ) { return unless $url->host eq $base->host; } push @urls, $url; }); last if $url_count == 100; sleep 1; } 
+4
source

All Articles