Perl html parsing lib / tool

Question

Perl html parsing lib / tool

Are there some powerful / lib tools for perl like BeautifulSoup for python?

thank

+5

perl beautifulsoup

icn Mar 20 '11 at 6:09

source share

3 answers

I have never used BeautifulSoup, but because of the quick look at your documentation, you might want HTML :: TreeBuilder . It can handle even broken documents well and allows you to navigate through tree parsing or query elements - see the look_downmethod in HTML :: Element .

/ XPath, . daxim. CSS, Web:: Mojo:: DOM.

+6

bvr 20 . '11 9:56

As you seek power, you can use XML :: LibXML to parse HTML. The advantage is that you have all the features of the fastest and best XML toolchain (excecpt MSXML, only MS) available for Perl to process your document, including XPath and XSLT (which will require re-analysis if you used a different parser XML: LibXML).

use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
    my( $subref, $bufref ) = @_;
    open my $fhbuf, '>', $bufref;
    local *STDERR = $fhbuf;
    $subref->(); # execute code that needs to be shut up
    return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
    sub { $dom = $parser->load_html( location => $url ) }, # code
    \my $errmsg; # buffer
# Now process document as XML.
my @nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for @nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );

+1

Lumi Mar 20 '11 at 11:08

source share

daxim · Accepted Answer · 2011-03-20T09:50:43+0000

HTML :: TreeBuilder :: XPath is a decent solution for most problems.

Perl html parsing lib / tool

More articles: