Search PHP script that can clear bad HTML

I'm in the process of writing a command line PHP script converts hundreds of HTML fragments to Markdown using the Markdownify library. However, I came across a situation where some of my HTML is not structured enough for use with Markdownify. So I first need to send my HTML through some library that can clear it and add optional closing tags, etc. I will work with partial HTML blocks, not HTML documents, so the returned HTML should be partial (and not include doctype, etc.).

Do you know a PHP script that can convert HTML to XHTML?

Decision:

Use the PHP class DOMDocument . It will format your HTML even if it is broken. Then you can extract the cleaned HTML:

libxml_use_internal_errors(true); //use this to prevent warning messages from displaying because of the bad HTML $doc = new DOMDocument(); $doc->loadHTML($badHtml); $goodHtml = $doc->saveHTML(); 

This will return the full HTML document (with the cleaned version in the body tag), although I passed the partial HTML block to it, so I can extract the cleaned partial using this regular expression:

 $goodHtmlPartial = trim(ereg_replace('(.*)<body>(.*)</body>(.*)', '\2', $goodHtml)); 
+6
php html-parsing
source share
5 answers

You can load the HTML into the DOM , then save as XML.

+5
source share

Any reason not to use it carefully?

http://php.net/manual/en/book.tidy.php

It can clear your html and provide you with only the body section.

 $tidy = tidy_repair_string($content,array( 'indent' => true, 'output-html' => true, 'wrap' => 80, 'show-body-only' => true, 'clean' => true, 'input-encoding' => 'utf8', 'output-encoding' => 'utf8', 'logical-emphasis' => false, 'bare' => true, )); 
+8
source share

Try an HTML cleaner ; it is fantastic to clean up bad HTML and can act as a filter for potentially malicious code.

+4
source share

I suggest you use the DOMDocument-> loadHTML () method. It will format your HTML even if it is broken. You can later save it as XML to get XHTML.

+2
source share

Not PHP, but BeautifulSoup library for python has parsers that are good for creating valid html for almost any old crap.

0
source share

All Articles