PHP analysis of invalid html

Question

PHP analysis of invalid html

I am trying to parse some html that is not on my server

$dom = new DOMDocument(); $dom->loadHTMLfile("http://www.some-site.org/page.aspx"); echo $dom->getElementById('his_id')->item(0);

but php returns an error something like ID his_id already defined in http://www.some-site.org/page.aspx, line: 33 . I think this is due to the fact that the DOMDocument is dealing with invalid html. So, how can I parse it, although this is not valid?

+7

php html-parsing domdocument

kmunky Apr 24 '10 at 1:11

source share

3 answers

cletus · Answer 1 · 2010-04-24T01:23:36+0000

You must run HTML Tidy to clear it before parsing it.

 $html = file_get_contents('http://www.some-site.org/page.aspx'); $config = array( 'clean' => 'yes', 'output-html' => 'yes', ); $tidy = tidy_parse_string($html, $config, 'utf8'); $tidy->cleanRepair(); $dom = new DOMDocument; $dom->loadHTML($tidy);

See this list of options .

Craig francis · Answer 2 · 2011-04-21T09:21:39+0000

Take a look: libxml_use_internal_errors ()

http://php.net/libxml_use_internal_errors

Annika backstrom · Answer 3 · 2010-04-24T01:24:41+0000

Reading documents, I see $dom->strictErrorChecking , which defaults to TRUE. What happens if you set $dom->strictErrorChecking = false ?

PHP analysis of invalid html

More articles: