PHP analysis of invalid html

I am trying to parse some html that is not on my server

$dom = new DOMDocument(); $dom->loadHTMLfile("http://www.some-site.org/page.aspx"); echo $dom->getElementById('his_id')->item(0); 

but php returns an error something like ID his_id already defined in http://www.some-site.org/page.aspx, line: 33 . I think this is due to the fact that the DOMDocument is dealing with invalid html. So, how can I parse it, although this is not valid?

+7
php html-parsing domdocument
source share
3 answers

You must run HTML Tidy to clear it before parsing it.

 $html = file_get_contents('http://www.some-site.org/page.aspx'); $config = array( 'clean' => 'yes', 'output-html' => 'yes', ); $tidy = tidy_parse_string($html, $config, 'utf8'); $tidy->cleanRepair(); $dom = new DOMDocument; $dom->loadHTML($tidy); 

See this list of options .

+6
source share

Take a look: libxml_use_internal_errors ()

http://php.net/libxml_use_internal_errors

+1
source share

Reading documents, I see $dom->strictErrorChecking , which defaults to TRUE. What happens if you set $dom->strictErrorChecking = false ?

0
source share

All Articles