Here are some improvements over @hijarian's answer:
LibXML Errors
If you do not call libxml_use_internal_errors(true) , PHP will print all the HTML errors found. However, if you call this function, errors will not be suppressed; instead, they will go to a heap that you can check by calling libxml_get_errors() . The problem is that he eats memory, and the DOMDocument is known to be very picky. If you process a large number of files in batch mode, you will end up running out of memory. There are two solutions for this:
if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); }
Since libxml_use_internal_errors(true) returns the previous value of this parameter ( false by default), this only leads to error correction if you run it more than once (as in batch processing).
Another option is to pass the flags LIBXML_NOERROR | LIBXML_NOWARNING LIBXML_NOERROR | LIBXML_NOWARNING to the loadHTML() method. Unfortunately, for reasons unknown to me, this still leaves a couple of errors.
Do not forget that DOMDocument always displays an error (even when using internal libxml errors and setting suppression flags) if you pass an empty (or empty) line to the load*() methods.
Regex
The regular expression />\s*</im does not make much sense, it is better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and replace only if actually existing spaces exist ( + instead of * ) without returning ( ++ ) is faster - and discard overhead case-insensitive (since the space has no case).
You can also normalize newline characters to \n and other control characters (especially if the HTML source is unknown), since \r will return as  after saveXML() for example.
DOMDocument::$preserveWhitespace useless and unnecessary after running the above regular expression.
Oh, and I don't see the need to protect empty pre-like tags here. Fragments containing only spaces are useless.
Extra Flags for loadHTML()
LIBXML_COMPACT - "it can speed up your application without having to change the code"LIBXML_NOBLANKS - more tests need to be done on thisLIBXML_NOCDATA - more tests need to be done on thisLIBXML_NOXMLDECL - documented but not implemented = (
UPDATE: Setting any of these parameters will not format the output.
On saveXML()
The DOMDocument::saveXML() method will issue an XML declaration. We must manually clear it (since LIBXML_NOXMLDECL not implemented). To do this, we could use the combination substr() + strpos() to find the first line break or even use a regular expression to clear it.
Another option that seems to be an added advantage is simple:
$dom->saveXML($dom->documentElement);
Another thing is if your built-in tags are empty, for example b , i or li in:
<b class="carret"></b> <i class="icon-dashboard"></i> Dashboard <li class="divider"></li>
The saveXML() method saveXML() seriously lure them (by placing the next element inside the empty one), ruining all your HTML. Tidy also has a similar problem, except that it just drops the node.
To fix this, you can use the LIBXML_NOEMPTYTAG flag along with saveXML() :
$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);
This option converts empty (aka self-closing) tags into inline tags and also allows empty inline tags.
HTML Commit [5]
With everything we have done so far, our HTML output has two main problems:
- no DOCTYPE (it was removed when we used
$dom->documentElement ) - empty tags are now inline tags, which means that one
<br /> has turned into two ( <br></br> ), etc.
Fixing the first one is pretty simple, as HTML5 is pretty permissive:
"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);
To return our empty tags, which are as follows:
areabasebasefont (deprecated in HTML5)brcolcommandembedframe (deprecated in HTML5)hrimginputkeygenlinkmetaparamsourcetrackwbr
We can either use str_[i]replace in the loop:
foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag) { $html = str_ireplace('>/<' . $tag . '>', ' />', $html); }
Or regex:
$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);
This is an expensive operation, I have not tested them, so I canβt tell you which one is better, but I would suggest preg_replace() . Also, I'm not sure if a case-insensitive version is needed. I get the impression that XML tags are always flattened. UPDATE: Tags are always at the bottom.
On <script> and <style> Tags
These tags will always have their own content (if it exists), encapsulated in (without commenting) CDATA blocks, which is likely to violate their meaning. You will need to replace these tokens with a regular expression.
Implementation
function DOM_Tidy($html) { $dom = new \DOMDocument(); if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); } $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html); if ((empty($html) !== true) && ($dom->loadHTML($html) === true)) { $dom->formatOutput = true; if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false) { $regex = array ( '~' . preg_quote('<![CDATA[', '~') . '~' => '', '~' . preg_quote(']]>', '~') . '~' => '', '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />', ); return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html); } } return false; }