How do you format DOM structures in PHP?

Question

How do you format DOM structures in PHP?

My first guess was the PHP DOM classes (with formatOutput ). However, I cannot get this HTML block to be formatted and output correctly. As you can see, indentation and alignment are incorrect.

$html = ' <html> <body> <div> <div> <div> <p>My Last paragraph</p> <div> This is another text block and some other stuff.<br><br> Again we will start a new paragraph and some other stuff <br> </div> </div> <div> <div> <h1>Another Title</h1> </div> <p>Some text again <b>for sure</b></p> </div> </div> <div> <pre><code> <span>&lt;html&gt;</span> <span>&lt;head&gt;</span> <span>&lt;title&gt;</span> Page Title <span>&lt;/title&gt;</span> <span>&lt;/head&gt;</span> <span>&lt;/html&gt;</span> </code></pre> </div> </div> </body> </html>'; header('Content-Type: text/plain'); libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->preserveWhiteSpace = false; $dom->formatOutput = true; $dom->loadHTML($html); print $dom->saveHTML();

Update: I added a pre-formatted block of code to the example.

+7

dom html php

Xeoncross Nov 03 '11 at 15:56

source share

2 answers

Here's a comment on php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630

It seems that when you load HTML from a string (like you), the DOMDocument becomes lazy and does not format anything in it.

Here's the solution to your problem:

 // Clean your HTML by hand first $html = preg_replace('/>\s*</im', '><', $html); $dom = new DOMDocument; $dom->loadHTML($html); $dom->formatOutput = true; $dom->preserveWhitespace = false; // Use saveXML(), not saveHTML() print $dom->saveXML();

Basically, you throw spaces between tags and use saveXML () instead of saveHTML (). saveHTML () just doesn't work in this situation. However, you will get an XML declaration in the first line of text.

+4

hijarian Nov 04 '11 at 14:27

source share

Alix axel · Accepted Answer · 2013-06-16T21:49:44+0000

Here are some improvements over @hijarian's answer:

LibXML Errors

If you do not call libxml_use_internal_errors(true) , PHP will print all the HTML errors found. However, if you call this function, errors will not be suppressed; instead, they will go to a heap that you can check by calling libxml_get_errors() . The problem is that he eats memory, and the DOMDocument is known to be very picky. If you process a large number of files in batch mode, you will end up running out of memory. There are two solutions for this:

 if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); }

Since libxml_use_internal_errors(true) returns the previous value of this parameter ( false by default), this only leads to error correction if you run it more than once (as in batch processing).

Another option is to pass the flags LIBXML_NOERROR | LIBXML_NOWARNING LIBXML_NOERROR | LIBXML_NOWARNING to the loadHTML() method. Unfortunately, for reasons unknown to me, this still leaves a couple of errors.

Do not forget that DOMDocument always displays an error (even when using internal libxml errors and setting suppression flags) if you pass an empty (or empty) line to the load*() methods.

Regex

The regular expression />\s*</im does not make much sense, it is better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and replace only if actually existing spaces exist ( + instead of * ) without returning ( ++ ) is faster - and discard overhead case-insensitive (since the space has no case).

You can also normalize newline characters to \n and other control characters (especially if the HTML source is unknown), since \r will return as  after saveXML() for example.

DOMDocument::$preserveWhitespace useless and unnecessary after running the above regular expression.

Oh, and I don't see the need to protect empty pre-like tags here. Fragments containing only spaces are useless.

Extra Flags for `loadHTML()`

LIBXML_COMPACT - "it can speed up your application without having to change the code"
LIBXML_NOBLANKS - more tests need to be done on this
LIBXML_NOCDATA - more tests need to be done on this
LIBXML_NOXMLDECL - documented but not implemented = (

UPDATE: Setting any of these parameters will not format the output.

On `saveXML()`

The DOMDocument::saveXML() method will issue an XML declaration. We must manually clear it (since LIBXML_NOXMLDECL not implemented). To do this, we could use the combination substr() + strpos() to find the first line break or even use a regular expression to clear it.

Another option that seems to be an added advantage is simple:

 $dom->saveXML($dom->documentElement);

Another thing is if your built-in tags are empty, for example b , i or li in:

 <b class="carret"></b> <i class="icon-dashboard"></i> Dashboard <li class="divider"></li>

The saveXML() method saveXML() seriously lure them (by placing the next element inside the empty one), ruining all your HTML. Tidy also has a similar problem, except that it just drops the node.

To fix this, you can use the LIBXML_NOEMPTYTAG flag along with saveXML() :

 $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option converts empty (aka self-closing) tags into inline tags and also allows empty inline tags.

HTML Commit [5]

With everything we have done so far, our HTML output has two main problems:

no DOCTYPE (it was removed when we used $dom->documentElement )
empty tags are now inline tags, which means that one <br /> has turned into two ( <br></br> ), etc.

Fixing the first one is pretty simple, as HTML5 is pretty permissive:

 "<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To return our empty tags, which are as follows:

area
base
basefont (deprecated in HTML5)
br
col
command
embed
frame (deprecated in HTML5)
hr
img
input
keygen
link
meta
param
source
track
wbr

We can either use str_[i]replace in the loop:

 foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag) { $html = str_ireplace('>/<' . $tag . '>', ' />', $html); }

Or regex:

 $html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is an expensive operation, I have not tested them, so I can’t tell you which one is better, but I would suggest preg_replace() . Also, I'm not sure if a case-insensitive version is needed. I get the impression that XML tags are always flattened. UPDATE: Tags are always at the bottom.

On `<script>` and `<style>` Tags

These tags will always have their own content (if it exists), encapsulated in (without commenting) CDATA blocks, which is likely to violate their meaning. You will need to replace these tokens with a regular expression.

Implementation

 function DOM_Tidy($html) { $dom = new \DOMDocument(); if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); } $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html); if ((empty($html) !== true) && ($dom->loadHTML($html) === true)) { $dom->formatOutput = true; if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false) { $regex = array ( '~' . preg_quote('<![CDATA[', '~') . '~' => '', '~' . preg_quote(']]>', '~') . '~' => '', '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />', ); return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html); } } return false; }

How do you format DOM structures in PHP?

LibXML Errors

Regex

Extra Flags for loadHTML()

On saveXML()

HTML Commit [5]

On <script> and <style> Tags

Implementation

More articles:

Extra Flags for `loadHTML()`

On `saveXML()`

On `<script>` and `<style>` Tags