How to remove external links from HTML using Perl?

I am trying to remove external links from an HTML document, but keep the bindings, but I have no luck. Next regex

$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;

will match the start of the anchor tag and the end of the external link tag, for example.

<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->

so i get nothing instead

<a HREF="#FN1" name="01">1</a>
some other html

It so happened that all anchors have their href attribute in upper case, so I know that I can make a case-sensitive match, but I do not want to rely on it, as this always happens in the future.

Can I change something that only matches the tag a?

0
source share
5 answers

, , , ( , , <a class="external" href="...">), s///.

s///, , , href, , , .

: ;-), , HTML::TokeParser::Simple. , HTML::TokeParser.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

:

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

NB: , "", , , .html, .htm. , , , href, . , , href . , , , , .

+11

, SAX HTML::Parser:

use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;

my $omitted;

sub tag_handler { 
    my ( $self, $tag_name, $text, $attr_hashref ) = @_;
    if ( $tag_name eq 'a' ) { 
        my $href = first {; defined } @$attr_hashref{ qw<href HREF> };
        $omitted = substr( $href, 0, 7 ) eq 'http://';
        return if $omitted;
    }
    print $text;
}

sub end_handler { 
    my $tag_name = shift;
    if ( $tag_name eq 'a' && $omitted ) { 
        $omitted = false;
        return;
    }
    print shift;
}

my $parser
    = HTML::Parser->new( api_version => 3
                       , default_h   => [ sub { print shift; }, 'text' ]
                       , start_h     => [ \&tag_handler, 'self,tagname,text,attr' ]
                       , end_h       => [ \&end_handler, 'tagname,text' ]
                       );
$parser->parse_file( $path_to_file ) or die $OS_ERROR;
+6

Another solution. I love HTML :: TreeBuilder and the family.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
    if ($a->attr('href') !~ /^#/) {
        $a->replace_with_content($a->as_text);
    }
}
print $root->as_HTML(undef, "\t");

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>
+1
source

Why not just remove links for which the href attribute does not start with a pound sign? Something like that:

html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;
0
source

Even simpler if you don't need tag attributes:

$html =~ s/<a[^>]+>(.+?)<\/a>/$1/sig;
0
source

All Articles