PHP XPATH HTML document, omitting all tags. I want to save them

I am parsing an HTML document with XPATH and I want to keep all internal html tags.

The specified html is an unordered list with many list items.

<ul id="adPoint1"><li>Business</li><li>Contract</li></ul> 

I am parsing a document using the following PHP code

 $dom = new DOMDocument(); @$dom->loadHTML($output); $this->xpath = new DOMXPath($dom); $testDom = $this->xpath->evaluate("//ul[@id='adPoint1']"); $test = $testDom->item(0)->nodeValue; echo htmlentities($test); 

For some reason, the output always has html tags that have been excluded from it. I guess this is because XPATH was not intended to be used that way, but is it anyway around this?

I would really like to continue using XPATH, since I already use it to parse other areas of the page (individual href elements) without problems.

EDIT: I know there is a better way to get data iterating through UL children. There is a more complex part of the page that I also want to parse (a javascript block), but I'm trying to provide a more understandable example.

Actual block of code I want

 <script language="javascript">document.write(rot_decode('<u7>Pbagnpg Qrgnvyf</u7><qy vq="pbagnpgQrgnvyf"><qg>Cu:</qg><qq>(58) 0078 8455</qq></qy>'));</script> 

He has a problem that he skips all closing tags, but keeps opening tags. I assume XPATH is trying to parse internal elements, not just treat it as a string.

If I try to select a script element with

 $testDom = $this->xpath->evaluate("//div[@id='businessDetails']/script"); $test = $testDom->item(0)->nodeValue; echo htmlentities($test); 

my conclusion will be, which you can see, all closing tags are missing.

 document.write(rot_decode('<u7>Pbagnpg Qrgnvyf<qy vq="pbagnpgQrgnvyf"><qg>Cu:<qq>(58) 0078 8455')); 
+4
source share
3 answers

I decided that XPATH was not suitable for what I wanted, and now I use PHP Simple HTML DOM Parser , which is much better for the task.

It supports html internal formatting simply.

 foreach($this->simpleDom->find('script[language=javascript]') as $script) { echo htmlentities($script->innertext()); } 
+2
source

Yes, you are right, the DOM parses the children (because they are elements and not ), and the right way to get data from the children is to iterate over all of them. However, the implementation of this will not be difficult.
You can try another XPath expression rather than

 //ul[@id='adPoint1'] 

to try

 //ul[@id='adPoint1']/li 

which will select items with actual string values.
If you give the expected result (for both ul and the script), you might get more answers.

+1
source

Pass Node as an optional argument in a call to saveHTML () on the owner document object.

 string DOMDocument::saveHTML ([ DOMNode $node = NULL ] ) 

Cm...

http://php.net/manual/en/domdocument.savehtml.php

0
source

All Articles