PHP: how to get the correct HTML element closing tag

Suppose I have an HTML page as follows:

<!-- This is the opening tag --> <div class="content_text"> <div>Title</div> <div>Author Name</div> <div>Some complicated HTML elements correctly validated</div> <b>Some more text</b> <img ... /> <div> more and more text </div> </div><!-- This is the correct closing tag --> 

How to get content between opening a div with class="content_text" and its correct closing tag?

I tried regular expressions, but I could not find a simple or even difficult way to do this.

I tried XPath , but so far I have not been able to get the content. Instead, I got the text inside the outer div.

+4
source share
4 answers
  $scrape_address = "http://www.al-madina.com/node/444862"; $ch = curl_init($scrape_address); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1'); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_ENCODING, ""); $data = curl_exec($ch); // I couldn't get an element by Attribute so I just replaced class to id $data = str_replace('class="content_text"','id="my_unique_id"',$data); $domd = new DOMDocument(); libxml_use_internal_errors(true); $domd->loadHTML($data); libxml_use_internal_errors(false); $div = $domd->getElementById("my_unique_id"); if ($div) { $dom2 = new DOMDocument(); $dom2->appendChild($dom2->importNode($div, true)); echo $dom2->saveHTML(); } else { echo "Nothing found"; } 
+4
source

You can use PHP Simple HTML DOM Parser to parse HTML, such as DOMDocument for XML.

Note: PHP also supports DOMDocument .

+5
source

I would suggest PHP DOMDocument - if the content is not always structured in the same way, regular expressions will not do, and even then it will not be beautiful.

Also, here's a question about a similar situation that was resolved using SimpleXML, perhaps this may help.

+2
source

You already seem to be able to successfully run XPath queries, so I skip the PHP code and immediately fall into the XPath part.

Not sure what you mean by "content", so I offer several alternatives:

You want all text nodes inside the <div/> :

 //div[@class="content_text"]//text() 

You want all XML, including elements:

 //div[@class="content_text"] 

Both return a set of results, be sure to focus on it.

0
source

All Articles