Question about php function preg_replace

I want to dynamically delete certain tags and their contents from an html file and think about how to use preg_replace, but I can not get the syntax correctly. Basically, this should, for example, do something like: Replace everything between (and including) "" with nothing.

Can someone help me with this please?

+4
source share
5 answers

Easy dude.

To have a Ungreedy regexpr, use the U modifier. To make it multi-line, use the s-modifier. Knowing that to delete all paragraphs, use this template:

#<p[^>]*>(.*)?</p>#sU 

Explain:

  • I use # separator so that there is no need to protect my \ characters (to have a more readable pattern)
  • <p[^>]*> : the part defining the starting paragraph (with a hypothetical style, for example)
  • (.*)? : All (in the "Insoluble" mode)
  • </p> : Obviously, the final paragraph

Hope that helps!

+5
source

I would suggest not trying to do this with regex. A safer approach is to use something like

Simple HTML DOM

Here is the link to the API link: Simple HTML DOM API Link

Another option is to use a DOMDocument

The idea here is to use a real HTML parser to parse the data, and then you can move / move around the tree and delete any elements / attributes / text that you need. This is a much cleaner approach than trying to use a regular expression to replace data in HTML.

 <?php $doc = new DOMDocument; $doc->loadHTMLFile('blah.html'); $content = $doc->documentElement; $table = $content->getElementsByTagName('table')->item(0); $delfirstTable = $content->removeChild($table); echo $doc->saveHTML(); ?> 
+2
source

If you are trying to sanitize your data, it is often recommended to use a whitelist rather than a blacklist of certain terms and tags. It is easier to disinfect and prevent XSS attacks. There is a well-known library called HTML Purifier , which, although large and somewhat slow, has amazing results in terms of cleaning your data.

+2
source

If you do not know what is between the tags, Phill's answer will not work.

This will work if there are no other tags between them, and this is certainly a simpler case. You can replace the div with any tag you need, obviously.

 preg_replace('#<div>[^<]+</div>#','',$html); 

If there may be other tags in the middle, this should work, but it may cause problems. You should probably go with the DOM solution above if so

 preg_replace('#<div>.+</div>#','',$html); 

They are not tested.

+2
source

PSEUDO CODE

 function replaceMe($html_you_want_to_replace,$html_dom) { return preg_replace(/^$html_you_want_to_replace/, '', $html_dom); } 

HTML Before

 <div>I'm Here</div><div>I'm next</div> <?php $html_dom = "<div>I'm Here</div><div>I'm next</div>"; $get_rid_of = "<div>I'm Here</div>"; replaceMe($get_rid_of); ?> 

HTML After

 <div>I'm next</div> 

I know this is a hack job

+1
source

All Articles