How to parse actual HTML from a page using CURL?
I am trying to clear a webpage with the following structures on the page:
<p class="row"> <span>stuff here</span> <a href="http://www.host.tld/file.html">Descriptive Link Text</a> <div>Link Description Here</div> </p> I am clearing a webpage using curl:
<?php $handle = curl_init(); curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/"); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($handle); curl_close($handle); ?> I did some research and found that I should not use RegEx to parse the HTML returned from curl, and that I should use the PHP DOM. Here is how I did it:
$newDom = new domDocument; $newDom->loadHTML($html); $newDom->preserveWhiteSpace = false; $sections = $newDom->getElementsByTagName('p'); $nodeNo = $sections->length; for($i=0; $i<$nodeNo; $i++){ $printString = $sections->item($i)->nodeValue; echo $printString . "<br>"; } Now I do not pretend to fully understand this, but I get the gist and I get the sections that I want. The only problem is that I only get the text of the HTML page, as if I had copied it from my browser window. What I want is the actual HTML, because I want to extract the links and use them like this:
for($i=0; $i<$nodeNo; $i++){ $printString = $sections->item($i)->nodeValue; echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>"; } As you can see, I canβt get the link because I get the text of the web page and not the source as I want. I know that "curl_exec" pulls HTML because I tried just that, so I believe that the DOM somehow strips the HTML that I want.
According to the comments on the PHP DOM tutorial , you should use the following inside your loop:
$tmp_dom = new DOMDocument(); $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true)); $innerHTML = trim($tmp_dom->saveHTML()); This will set $innerHTML to the HTML content of the node.
But I think you really want to get the "a" nodes under the "p" node, so do the following:
$sections = $newDom->getElementsByTagName('p'); $nodeNo = $sections->length; for($i=0; $i<$nodeNo; $i++) { $sec = $sections->item($i); $links = $sec->getElementsByTagName('a'); $linkNo = $links->length; for ($j=0; $j<$linkNo; $j++) { $printString = $links->item($j)->nodeValue; echo $printString . "<br>"; } } It simply prints the body of each link.
You can pass the node to DOMDocument::saveXML() . Try the following:
$printString = $newDom->saveXML($sections->item($i));
you can take a look at phpQuery to do server side HTML parsing. basic example