PHP Dom documents: getting textContent ignoring tags and comments script

i uses dom doc to load html from the database as follows:

$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();

Then I get the body text by doing the following:

$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);

The words I received included everything in <body>. Things like <scripts>were also included. How to delete them and save only real text content?

+5
source share
2 answers

You need to visit all the nodes and return their text. If some of them contain other nodes, visit them too.

This can be done using this basic recursive algorithm:

extractNode:
    if node is a text node or a cdata node, return its text
    if is an element node or a document node or a document fragment node:
        if it’s a script node, return an empty string
        return a concatenation of the result of calling extractNode on all the child nodes
    for everything else return nothing

Implementation:

function extractText($node) {    
    if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
        return $node->nodeValue;
    } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
        if ('script' === $node->nodeName) return '';

        $text = '';
        foreach($node->childNodes as $childNode) {
            $text .= extractText($childNode);
        }
        return $text;
    }
}

This will return the textContent of the given $ node, ignoring the tags and comments of the script.

$words = htmlspecialchars(extractText($bodyNodes->item(0)));

: http://codepad.org/CS3nMp7U

+5

XPath .

HTML arnaud, :

$html = <<< HTML
<p>
    test<span>foo<b>bar</b>
</p>
<script>
    ignored
</script>
<!-- comment is ignored -->
<p>test</p>
HTML;

query , script . , preserveWhiteSpace, , , .

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHtml($html);

$xp    = new DOMXPath($dom);
$nodes = $xp->query('/html/body//text()[
    not(ancestor::script) and
    not(normalize-space(.) = "")
]');

foreach($nodes as $node) {
    var_dump($node->textContent);
}

()

string(10) "
    test"
string(3) "foo"
string(3) "bar"
string(4) "test"
+5

All Articles