Xpath with html5lib in PHP

I have this base code that does not work. How to use Xpath with html5lib php? Or an Xpath with HTML5 in any other way.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);

$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');

foreach ($elements as $element)
{
    var_dump($element);
}

No items found. Usage $xpath->query('.')works to get the root element (usually xpath works). $dom->getElementsByTagName('h1')works.

+4
source share
2 answers

So it looks like html5lib sets us a default namespace.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
    echo $de->namespaceURI . "\n";
}

It is output:

 http://www.w3.org/1999/xhtml

To query nodes with names using xpath, you need to register the namespace and use the prefix in the query.

$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);

$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
{
    echo $element->nodeValue;
}

Outputs PHP.


, xpath, , .

$de = $dom->documentElement;
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns

xpath, .

$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
    echo $element->nodeValue;
}

PHP.


, :

:

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
    $de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
    $dom->loadXML($dom->saveXML());
}

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
    var_dump($element);
}

:

class DOMElement#11 (18) {
  public $tagName =>
  string(2) "h1"
  public $schemaTypeInfo =>
  NULL
  public $nodeName =>
  string(2) "h1"
  public $nodeValue =>
  string(3) "PHP"
  ...
  public $textContent =>
  string(3) "PHP"
}
+4

disable_html_ns.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
    'disable_html_ns' => true, // add `disable_html_ns` option
));
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');

foreach ($elements as $element) {
    var_dump($element);
}

https://github.com/Masterminds/html5-php#options

disable_html_ns (boolean): HTML5 DOM. DOM, .

+4

All Articles