Using xmllint and xpath with less advanced HTML document?

I have an HTML page created using an existing tool - I cannot change the output of this tool.

However, I want to use xmllint with the xmllint option to highlight several specific pieces of information from a loaded web page. The problem is that the page starts with:

<html lang=en><head>...

And xmllint produces errors almost immediately:

 html.out:2: parser error : AttValue: " or ' expected <html lang=en><head> ^ 

Probably the problem is the lack of closing quotes around the value of the lang attribute. This whole page is full of this problem. (Although only sporadically.)

Almost every browser can parse this just fine - how can I convince xmllint to do this? I would like to avoid having to introduce an intermediate step to “fix” the file. Instead, I would like to:

1) Find a flag, check parameter, etc. that helps the parser, or:

2) Use another tool. (But what? xmllint always my move for XPath command line commands.)

Next, using only xpath , we get:

 > xpath html.out '//myquery...' not well-formed (invalid token) at line 2, column 11, ... 
+8
html xml xpath xmllint
source share
3 answers

You can enable the HTML parser in xmllint using the --html command line. This way you can process HTML documents.

+11
source share

You must pre-process the HTML with a soft parser. (This is the main difference: HTML has a much weaker syntax than XML.) That is, try HTML5-Tidy and let XMLLint work on the result:

 input HTML | v Tidy | v xmllint | v result 
+4
source share

If it does not cancel the parsing, you can simply hide the errors:

 2>/dev/null 

Then there is Xidel , which I did only to select some data from html pages. (although this is not perfect. I was told about two distorted documents that he could not process)

 xidel html.out -e //yourquery... 
+4
source share

All Articles