Using xmllint and xpath with less advanced HTML document?

Question

Using xmllint and xpath with less advanced HTML document?

I have an HTML page created using an existing tool - I cannot change the output of this tool.

However, I want to use xmllint with the xmllint option to highlight several specific pieces of information from a loaded web page. The problem is that the page starts with:

<html lang=en><head>...

And xmllint produces errors almost immediately:

 html.out:2: parser error : AttValue: " or ' expected <html lang=en><head> ^

Probably the problem is the lack of closing quotes around the value of the lang attribute. This whole page is full of this problem. (Although only sporadically.)

Almost every browser can parse this just fine - how can I convince xmllint to do this? I would like to avoid having to introduce an intermediate step to “fix” the file. Instead, I would like to:

1) Find a flag, check parameter, etc. that helps the parser, or:

2) Use another tool. (But what? xmllint always my move for XPath command line commands.)

Next, using only xpath , we get:

 > xpath html.out '//myquery...' not well-formed (invalid token) at line 2, column 11, ...

+8

html xml xpath xmllint

Craig otis Jan 31 '14 at 12:14

source share

3 answers

You must pre-process the HTML with a soft parser. (This is the main difference: HTML has a much weaker syntax than XML.) That is, try HTML5-Tidy and let XMLLint work on the result:

 input HTML | v Tidy | v xmllint | v result

+4

Boldewyn Jan 31 '14 at 12:26

source share

If it does not cancel the parsing, you can simply hide the errors:

 2>/dev/null

Then there is Xidel , which I did only to select some data from html pages. (although this is not perfect. I was told about two distorted documents that he could not process)

 xidel html.out -e //yourquery...

+4

Benibela Jan 31 '14 at 12:33

source share

Stefano sanfilippo · Accepted Answer · 2014-01-31T12:26:30+0000

You can enable the HTML parser in xmllint using the --html command line. This way you can process HTML documents.

Using xmllint and xpath with less advanced HTML document?

More articles: