How to make pQuery work with slightly distorted HTML?

Question

How to make pQuery work with slightly distorted HTML?

pQuery is the pragmatic port of the jQuery JavaScript framework for Perl, which you can use to clear the screen.

pQuery is pretty sensitive to garbled HTML. Consider the following example:

use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

pQuery will not find the title tag in the above example due to the double " >>" in invalid HTML.

To make my pQuery-based applications more tolerant of distorted HTML, I need to pre-process the HTML by clearing it before passing it to pQuery.

Starting with the code snippet above, which is the most reliable clean perl way to clear HTML code to get it to parse: is pQuery possible?

+5

jquery perl cpan screen-scraping

knorv Oct 9 '10 at 15:39

source share

3 answers

Try HTML::Tidyone that fixes invalid HTML.

+2

lonesomeday Oct 9 '10 at 15:47

source share

- Is this what you want?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;

-1

elektronikLexikon Oct 9 '10 at 16:00

source share

cjm · Accepted Answer · 2010-10-09T19:27:03+0000

I would report this as a bug in pQuery. Here is a workaround:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

This doesn't make much sense since pQuery already uses HTML :: TreeBuilder as the main parsing engine, but it works.

How to make pQuery work with slightly distorted HTML?

More articles: