I understand that this question is old, but the other answers never answered the question. If you don't mind writing PHP code, CubicleSoft Ultimate Web Scraper Toolkit has the TagFilter class:
https://github.com/cubiclesoft/ultimate-web-scraper/blob/master/support/tag_filter.php
You pass two things: an array of parameters and data for analysis as HTML.
To clear the broken HTML, the default parameters from TagFilter :: GetHTMLOptions () will be used as a good starting point. These parameters form the basis of valid HTML content and, without doing anything else, will clear any input into something that another tool, such as the Simple HTML DOM, can correctly parse in the DOM model.
However, another way to use the class is to change the default parameters and add the "callback" option to the parameter array. For each tag in HTML, the specified callback function will be called. The callback is expected to return what to do with each tag that the real strength of TagFilter enters. You can save any tag and some or all of its attributes (or change them), get rid of the tag, but keep the internal content, save the tag, but get rid of the content, change the content (to close the tags) or get rid of the tag and internal content. This approach allows you to tremendously improve control over the most confusing HTML and processes input in a single pass. See the same set of repository tests, for example, using TagFilter.
The only drawback is that the callback should keep track of where it is between each call, while something like the Simple HTML DOM selects things based on a DOM-like model. BUT, whatโs the only drawback if the document being processed has things like "id" and "class" ... most of the content of Word / Libre HTML doesnโt mean that it means a giant frame of unrecognizable / unchecked HTML in relation to DOM processing tools go .
Cubiclesoft
source share