You need to visit all the nodes and return their text. If some of them contain other nodes, visit them too.
This can be done using this basic recursive algorithm:
extractNode:
if node is a text node or a cdata node, return its text
if is an element node or a document node or a document fragment node:
if it’s a script node, return an empty string
return a concatenation of the result of calling extractNode on all the child nodes
for everything else return nothing
Implementation:
function extractText($node) {
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';
$text = '';
foreach($node->childNodes as $childNode) {
$text .= extractText($childNode);
}
return $text;
}
}
This will return the textContent of the given $ node, ignoring the tags and comments of the script.
$words = htmlspecialchars(extractText($bodyNodes->item(0)));
: http://codepad.org/CS3nMp7U