How to make pQuery work with slightly distorted HTML?

pQuery is the pragmatic port of the jQuery JavaScript framework for Perl, which you can use to clear the screen.

pQuery is pretty sensitive to garbled HTML. Consider the following example:

use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

pQuery will not find the title tag in the above example due to the double " >>" in invalid HTML.

To make my pQuery-based applications more tolerant of distorted HTML, I need to pre-process the HTML by clearing it before passing it to pQuery.

Starting with the code snippet above, which is the most reliable clean perl way to clear HTML code to get it to parse: is pQuery possible?

+5
source share
3 answers

I would report this as a bug in pQuery. Here is a workaround:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

This doesn't make much sense since pQuery already uses HTML :: TreeBuilder as the main parsing engine, but it works.

+4
source

Try HTML::Tidyone that fixes invalid HTML.

+2
source

- Is this what you want?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;
-1
source

All Articles