Perl html parsing lib / tool

Are there some powerful / lib tools for perl like BeautifulSoup for python?

thank

+5
source share
3 answers

HTML :: TreeBuilder :: XPath is a decent solution for most problems.

+9
source

I have never used BeautifulSoup, but because of the quick look at your documentation, you might want HTML :: TreeBuilder . It can handle even broken documents well and allows you to navigate through tree parsing or query elements - see the look_downmethod in HTML :: Element .

/ XPath, . daxim. CSS, Web:: Mojo:: DOM.

+6

As you seek power, you can use XML :: LibXML to parse HTML. The advantage is that you have all the features of the fastest and best XML toolchain (excecpt MSXML, only MS) available for Perl to process your document, including XPath and XSLT (which will require re-analysis if you used a different parser XML: LibXML).

use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
    my( $subref, $bufref ) = @_;
    open my $fhbuf, '>', $bufref;
    local *STDERR = $fhbuf;
    $subref->(); # execute code that needs to be shut up
    return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
    sub { $dom = $parser->load_html( location => $url ) }, # code
    \my $errmsg; # buffer
# Now process document as XML.
my @nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for @nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );
+1
source

All Articles