Using regex to extract URLs from plain text with Perl

Question

Using regex to extract URLs from plain text with Perl

How can I use Perl regular expressions to extract all URLs of a specific domain (with possible variable subdomains) with a specific extension from plain text? I tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

This fails and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that this would not happen, because I use .*?which should not be greedy and give me the least match. Can someone tell me what I am doing wrong? (I don't want any uber-complex, canned regular expression to check URLs, I want to know what I'm doing wrong, so I can learn from it.)

+5

url regex perl

test1234 Jun 27 '09 at 18:07

7

Schwern · Answer 1 · 2009-06-27T18:29:40+0000

URI::Find . URI, . , .

UPDATE: .

Telemachus · Answer 2 · 2009-06-27T18:15:37+0000

CPAN: Regexp::Common::URI

. , , .

URL-, , .

#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw/URI/;

while (<>) {
  if (m/$RE{URI}{HTTP}{-keep}/) {
    print $_ if $1 =~ m/what-you-want/;
  }
}

Pushpendra · Answer 3 · 2012-05-02T11:16:05+0000

,
*.htm, *.html, *.gif, *.jpeg. . script *.html , *.htm, "htm". .

: , , , .
: .

:

use strict;
use warnings;

if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;

my @Links;
foreach (<FILE_LINKS>) {
    my @tempArray;
    my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
    for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
        push( @Links, $Matches[$i] );
        }
    }
print FILE_RESULT join( "\n", @Links );

:

http://homepage.com/woot.gif
http://shomepage.com/woot.gif

DougWebb · Answer 4 · 2009-06-30T04:39:01+0000

URL- , . *? \S *?, .

user6320052 · Answer 5 · 2016-05-11T11:28:36+0000

https?\:\/\/[^\s]+[\/\w]

AmbroseChapel · Answer 6 · 2009-06-28T11:35:32+0000

, , . *?

, . http , .

, , . . :

m|(http://.*?homepage.com\/.*?\.gif)|

m#(http://.*?homepage.com\/.*?\.gif)#

m<(http://.*?homepage.com\/.*?\.gif)>

, . perlre.

sdaau · Answer 7 · 2011-08-05T22:17:18+0000

() get | extract | URL- string |, , , :

m,(http.*?://([^\s)\"](?!ttp:))+),g

... :

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'


a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 

http://www.abc.com/dss.htm?a=1&p=2#chk
https://poi.com/a%20b
http://bbb.com
http://roch.com/abc

noob, :

$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'

http(s):// - , " ) "exit"; , "exit" "http" ( ); , "" , "ttp:".

:

Hope this helps someone,
Hooray!

EDIT: Ups, the just-found URI :: Find :: Simple - search.cpan.org seems to do the same (via regex - getting the site title from a link in a string )

Using regex to extract URLs from plain text with Perl

More articles: