PHP: how to get the correct HTML element closing tag

Question

PHP: how to get the correct HTML element closing tag

Suppose I have an HTML page as follows:

<!-- This is the opening tag --> <div class="content_text"> <div>Title</div> <div>Author Name</div> <div>Some complicated HTML elements correctly validated</div> <b>Some more text</b> <img ... /> <div> more and more text </div> </div><!-- This is the correct closing tag -->

How to get content between opening a div with class="content_text" and its correct closing tag?

I tried regular expressions, but I could not find a simple or even difficult way to do this.

I tried XPath , but so far I have not been able to get the content. Instead, I got the text inside the outer div.

+4

php regex xpath domdocument

Shehabix Apr 9 '13 at 10:07

source share

4 answers

You can use PHP Simple HTML DOM Parser to parse HTML, such as DOMDocument for XML.

Note: PHP also supports DOMDocument .

+5

Shoe Apr 9 '13 at 22:22

source share

I would suggest PHP DOMDocument - if the content is not always structured in the same way, regular expressions will not do, and even then it will not be beautiful.

Also, here's a question about a similar situation that was resolved using SimpleXML, perhaps this may help.

+2

pilsetnieks Apr 9 '13 at 22:22

source share

You already seem to be able to successfully run XPath queries, so I skip the PHP code and immediately fall into the XPath part.

Not sure what you mean by "content", so I offer several alternatives:

You want all text nodes inside the <div/> :

 //div[@class="content_text"]//text()

You want all XML, including elements:

 //div[@class="content_text"]

Both return a set of results, be sure to focus on it.

0

Jens erat Apr 9 '13 at 10:52

source share

Shehabix · Accepted Answer · 2013-04-09T22:22:33+0000

  $scrape_address = "http://www.al-madina.com/node/444862"; $ch = curl_init($scrape_address); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1'); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_ENCODING, ""); $data = curl_exec($ch); // I couldn't get an element by Attribute so I just replaced class to id $data = str_replace('class="content_text"','id="my_unique_id"',$data); $domd = new DOMDocument(); libxml_use_internal_errors(true); $domd->loadHTML($data); libxml_use_internal_errors(false); $div = $domd->getElementById("my_unique_id"); if ($div) { $dom2 = new DOMDocument(); $dom2->appendChild($dom2->importNode($div, true)); echo $dom2->saveHTML(); } else { echo "Nothing found"; }

PHP: how to get the correct HTML element closing tag

More articles: