I need to quickly remove a set of classes from an arbitrary html string

HTML is run through a cleaner first (tinyMCE + Wordpress), so it must conform to several standard forms. all script and style tags are erased, and all data tags are inside html_encoded, so there are no extraneous characters to worry about.

I know that the common position for parsing html with regular expressions is “not,” but in this particular example the problem seems less like parsing and more like simple string processing ... am I missing some invisible level of complexity?

As far as I can break it, it seems that the model in question can be broken down into logical components:

  • /<[a-zA-Z][^>]+ - matches the beginning of any html tag and any combination of tags and attributes inside, but not the end bracket
  • (?i:class)=\" - beginning of a case-insensitive class attribute
  • (?: - run the non-adaptive subframe
  • (?: *[a-zA-Z_][\w-]* +)* - any number of class names (or none), but if they exist, there must be spaces before the capture
  • ( * .implode ('|', $ classes). *) - a set of classes to capture, preg_quoted
  • (?: +[a-zA-Z_][\w-]* *)* - any number of class names (or none), but if they exist, there must be spaces after capture
  • )+ - close a sloppy subpattern and loop it if several matching classes are in the same attribute
  • \"(?: [^>]*)>/ - the end of the class attribute and all the way to the end of the html tag

creating the last regex:

 $pattern = "/<[a-zA-Z][^>]+ (?i:class)=\"(?:(?: *[a-zA-Z_][\w-]* +)*( *".implode('|', $classes)." *)(?: +[a-zA-Z_][\w-]* *)*)+\"(?: [^>]*)>/"; 

I have not tried running this yet, because I know if it works, I am very tempted to use it, but doing it through preg_replace seems like it should do this work, except for one minor problem, I think it will leave extraneous gaps around the capture area. This is not a significant issue, but it would be nice to avoid if someone knows how to do it.

It should also be noted that this is not a critical process, and if my capture occasionally does not delete classes, no one dies.

therefore, in essence ... can someone explain what makes this idea bad in this case?

+5
source share
2 answers

Ok, is this a list of classes that you want to remove from the given html?

what I want to say is a list of the given classes that you want to delete. Can you give an example of a typical html, what it is and why you want to change it. Example:

Front

 <div class="someClass"> <i class="dontchange doChange"></i> <a class="hello john"></a> </div> 

Change to

 <div> <i class="dontchange"></i> <a></a> </div> 
0
source

This will replace all classes in all html.

 myHtml.replace(/class\=\"[^\"]*\"/g,''); 

Is this what you are looking for? Or something more specific?

-1
source

All Articles