How to parse actual HTML from a page using CURL?

Question

How to parse actual HTML from a page using CURL?

I am trying to clear a webpage with the following structures on the page:

<p class="row"> <span>stuff here</span> <a href="http://www.host.tld/file.html">Descriptive Link Text</a> <div>Link Description Here</div> </p>

I am clearing a webpage using curl:

 <?php $handle = curl_init(); curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/"); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($handle); curl_close($handle); ?>

I did some research and found that I should not use RegEx to parse the HTML returned from curl, and that I should use the PHP DOM. Here is how I did it:

 $newDom = new domDocument; $newDom->loadHTML($html); $newDom->preserveWhiteSpace = false; $sections = $newDom->getElementsByTagName('p'); $nodeNo = $sections->length; for($i=0; $i<$nodeNo; $i++){ $printString = $sections->item($i)->nodeValue; echo $printString . "<br>"; }

Now I do not pretend to fully understand this, but I get the gist and I get the sections that I want. The only problem is that I only get the text of the HTML page, as if I had copied it from my browser window. What I want is the actual HTML, because I want to extract the links and use them like this:

 for($i=0; $i<$nodeNo; $i++){ $printString = $sections->item($i)->nodeValue; echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>"; }

As you can see, I can’t get the link because I get the text of the web page and not the source as I want. I know that "curl_exec" pulls HTML because I tried just that, so I believe that the DOM somehow strips the HTML that I want.

+6

dom html php regex

Brian Aug 4 '10 at 19:48

source share

3 answers

You can pass the node to DOMDocument::saveXML() . Try the following:

$printString = $newDom->saveXML($sections->item($i));

+1

janmoesen Aug 4 '10 at 20:02

source share

you can take a look at phpQuery to do server side HTML parsing. basic example

0

Scott Evernden Aug 4 '10 at 19:59

source share

Borealid · Accepted Answer · 2010-08-04T19:53:00+0000

According to the comments on the PHP DOM tutorial , you should use the following inside your loop:

  $tmp_dom = new DOMDocument(); $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true)); $innerHTML = trim($tmp_dom->saveHTML());

This will set $innerHTML to the HTML content of the node.

But I think you really want to get the "a" nodes under the "p" node, so do the following:

 $sections = $newDom->getElementsByTagName('p'); $nodeNo = $sections->length; for($i=0; $i<$nodeNo; $i++) { $sec = $sections->item($i); $links = $sec->getElementsByTagName('a'); $linkNo = $links->length; for ($j=0; $j<$linkNo; $j++) { $printString = $links->item($j)->nodeValue; echo $printString . "<br>"; } }

It simply prints the body of each link.

How to parse actual HTML from a page using CURL?

More articles: