PHP DOMDocument Namespaces

I am writing a script that takes a web page and detects how many times the material similar to the facebook button is used. Since this is best done using the DOM, I decided to use the PHP DOMDocument.

The only problem I encountered is with elements like facebook, like a button:

<fb:like send="true" width="450" show_faces="true"></fb:like> 

Since this element technically has a namespace of "fb", DOMDocument issues a warning that this namespace prefix is ​​undefined. Then it goes on to disable the prefix, so when I get to the mentioned element, its tag is no longer fb: sort of, but instead, for example.

Is there a way to “pre-register” a namespace? Any suggestions?

+7
source share
6 answers

I had the same problem and came up with the following solutions / workarounds:

There is no clean way to parse HTML with namespaces using a DOMDocument without losing namespaces, but there are some workarounds:

  • Use another parser that accepts namespaces in HMTL code. Look here for a detailed and detailed list of HTML parsers. This is probably the most effective way to do this.
  • If you want to stick with DOMDocument, you basically need to execute the pre and post process code.

    • Before sending the code to DOMDocument-> loadHTML, use regular expressions, loops, or whatever you want to find all tags with a name extension, and add a custom attribute to the opening tags containing the namespace.

       <fb:like send="true" width="450" show_faces="true"></fb:like> 

      will result in

       <fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like> 
    • Now give the edited code DOMDocument-> loadHTML. It will supplant namespaces, but it will retain attributes leading to

       <like xmlNamespace="fb" send="true" width="450" show_faces="true"></like> 
    • Now (again using regular expressions, loops, or whatever you need) find all the tags with the xmlNamespace attribute and replace the attribute with the actual namespace. Remember to also add the namespace to the closing tags!

I don’t think the OP is still looking for an answer, I’m just posting this to everyone who finds this post in their research.

0
source

You can use tidy to use things before using the xml parser.

 $tidy = new tidy(); $config = array( 'output-xml' => true, 'input-xml' => true, 'add-xml-decl' => true, ); $tidy->ParseString($htmlSoup, $config); $tidy->cleanRepair(); echo $tidy; 
+4
source

Since this was never "solved", I decided to go ahead and implement the syndication solution for everyone who does not like to evaluate regular expressions.

 // do this before you use loadHTML() // store any name spaced elements so we can re-add them later $postContent = preg_replace('/<(\w+):(\w+)/', '<\1 data-namespace="\2"' , $postContent); // once you are done using domdocument fix things up // re-construct any name-spaced tags $postContent = preg_replace('/<(\w+) data-namespace="(\w+)"/', '<\1:\2 ' , $postContent); 
+1
source

Is this what you are looking for?

You can try SimpleHTMLDOM . Then you can run something like ...

 $html = new simple_html_dom(); $html->load_file('fileToParse.html'); $count=0; foreach($html->find('fb:like') as $element){ $count+=1 } echo $count; 

That should work.

I looked a little further and found this. I took this from a DOMDocument on PHP.net.

 $dom = new DOMDocument; $dom->loadHTML('fileToParse.html'); // or $dom->loadXML('fileToParse.html'); $likes = $dom->getElementsByTagName('fb:like'); $count=0; foreach ($likes as $like) { $count+=1; } 

After that i got stuck

 $file=file_get_contents("other.html"); $search = '/<fb:like[^>]*>/'; $count = preg_match_all($search , $file, $matches); echo $count; //Below is not needed print_r($matches); 

It is, however, RegEx and rather slow. I tried:

 $dom = new DOMDocument; $xpath = new DOMXPath($dom); $dom->load("other.html"); $xpath = new DOMXPath($dom); $rootNamespace = $dom->lookupNamespaceUri($dom->namespaceURI); $xpath->registerNamespace('fb', $rootNamespace); $elementList = $xpath->query('//fb:like'); 

But I got the same error as you.

0
source

Could not find a way to do this using the DOM . I am surprised that regex is slower than DOMDocument , as this is usually not for me. strpos should be the fastest:

 strpos($dom, '<fb:like'); 

This only finds the first event, but you can write a simple recursive function that changes the offset accordingly.

0
source

tried regEx-solution ... there is a problem with closing tags, as they do not accept attributes!

 <ns namespace="node">text</ns> 

(first of all, regEx did not look for closing tags ...) so finally I did some UGLY things like

 $output = preg_replace('/<(\/?)(\w+):(\w+)/', '<\1\2thistaghasanamespace\3' , $output); 

and

 $output = preg_replace('/<(\/?)(\w+)thistaghasanamespace(\w+)/', '<\1\2:\3' , $output); 
-one
source

All Articles