You load the HTML into the DOMDocument class, load it into simpleXML. Then you execute an XPath query for all p-elements and then look through them. In each loop, you rename the class attribute to something like "killmeplease."
When this is done, repeat the simpleXML procedure as XML (which, by the way, can change the HTML, but usually only for the better), and you will have an HTML string in which each p has a class of "killmeplease". Use str_replace to remove them.
Example:
$html_file = "somehtmlfile.html"; $dom = new DOMDocument(); $dom->loadHTMLFile($html_file); $xml = simplexml_import_dom($dom); $paragraphs = $xml->xpath("//p"); foreach($paragraphs as $paragraph) { $paragraph['class'] = "killmeplease"; } $new_html = $xml->asXML(); $better_html = str_replace('class="killmeplease"', "", $new_html);
Or if you want to make the code simpler but get confused with preg_replace, you can go with:
$html_file = "somehtmlfile.html"; $html_string = file_get_contents($html_file); $bad_p_class = "/(<p ).*(class=.*)(\s.*>)/"; $better_html = preg_replace($bad_p_class, '$1 $3', $html_string);
The hard part with regular expressions is that they tend to be greedy, and trying to disable this can cause problems if your p element tag has a line break in it. But give one of them a shot.
Anthony
source share