I'm in the process of writing a command line PHP script converts hundreds of HTML fragments to Markdown using the Markdownify library. However, I came across a situation where some of my HTML is not structured enough for use with Markdownify. So I first need to send my HTML through some library that can clear it and add optional closing tags, etc. I will work with partial HTML blocks, not HTML documents, so the returned HTML should be partial (and not include doctype, etc.).
Do you know a PHP script that can convert HTML to XHTML?
Decision:
Use the PHP class DOMDocument . It will format your HTML even if it is broken. Then you can extract the cleaned HTML:
libxml_use_internal_errors(true); //use this to prevent warning messages from displaying because of the bad HTML $doc = new DOMDocument(); $doc->loadHTML($badHtml); $goodHtml = $doc->saveHTML();
This will return the full HTML document (with the cleaned version in the body tag), although I passed the partial HTML block to it, so I can extract the cleaned partial using this regular expression:
$goodHtmlPartial = trim(ereg_replace('(.*)<body>(.*)</body>(.*)', '\2', $goodHtml));
php html-parsing
Andrew
source share