Using regex to extract URLs from plain text with Perl

How can I use Perl regular expressions to extract all URLs of a specific domain (with possible variable subdomains) with a specific extension from plain text? I tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

This fails and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that this would not happen, because I use .*?which should not be greedy and give me the least match. Can someone tell me what I am doing wrong? (I don't want any uber-complex, canned regular expression to check URLs, I want to know what I'm doing wrong, so I can learn from it.)

+5
7

URI::Find . URI, . , .

UPDATE: .

+16

CPAN: Regexp::Common::URI

. , , .

URL-, , .

#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw/URI/;

while (<>) {
  if (m/$RE{URI}{HTTP}{-keep}/) {
    print $_ if $1 =~ m/what-you-want/;
  }
}
+5

,
*.htm, *.html, *.gif, *.jpeg. . script *.html , *.htm, "htm". .

: , , , .
: .

:

use strict;
use warnings;

if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;

my @Links;
foreach (<FILE_LINKS>) {
    my @tempArray;
    my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
    for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
        push( @Links, $Matches[$i] );
        }
    }
print FILE_RESULT join( "\n", @Links );

:

http://homepage.com/woot.gif
http://shomepage.com/woot.gif
+2

URL- , . *? \S *?, .

+1
https?\:\/\/[^\s]+[\/\w]

+1

, , . *?

, . http , .

, , . . :

m|(http://.*?homepage.com\/.*?\.gif)|

m#(http://.*?homepage.com\/.*?\.gif)#

m<(http://.*?homepage.com\/.*?\.gif)>

, . perlre.

0

() get | extract | URL- string |, , , :

m,(http.*?://([^\s)\"](?!ttp:))+),g

... :

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'


a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 

http://www.abc.com/dss.htm?a=1&p=2#chk
https://poi.com/a%20b
http://bbb.com
http://roch.com/abc

noob, :

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'

http(s):// - , " ) "exit"; , "exit" "http" ( ); , "" , "ttp:".

:

Hope this helps someone,
Hooray!

EDIT: Ups, the just-found URI :: Find :: Simple - search.cpan.org seems to do the same (via regex - getting the site title from a link in a string )

0
source

All Articles