Regular expression matches spaces but skips sections

I understand that since Regex is essentially stateless, it is quite difficult to achieve complex matches without resorting to supplementing the application logic, but I am curious to know if the following is possible.

Match all spaces, easy enough: \s+

But skip the spaces between specific delimiters, in my case <pre> and </pre> word nostrip .

Are there any tricks to achieve this? I thought of the lines of two separate matches, one for all spaces, and one for <pre> blocked the nostrip sections and somehow denied the last of the first.

 "This is some text NOSTRIP this is more text NOSTRIP some more text." // becomes "ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext." 

Nested tags nostrip partitions do n't matter, and I'm not trying to parse an HTML tree or anything else , just embellishing the text file , but keeping spaces in <pre> blocks nostrip partitions for obvious reasons.

(it is better?)


This is ultimately what I went with. I am sure that it can be optimized in several places, but now it works well.

 public function stripWhitespace($html, Array $skipTags = array('pre')){ foreach($skipTags as &$tag){ $tag = "<{$tag}.*?/{$tag}>"; } $skipped = array(); $buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si', function($match) use(&$skipped){ $skipped[] = $match['tag']; return "\x1D" . (count($skipped) - 1) . "\x1D"; }, $html ); $buffer = preg_replace('#\s+#si', ' ', $buffer); $buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer); for($i = count($skipped) - 1; $i >= 0; $i--){ $buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer); } return $buffer; } 
+4
source share
2 answers

I use a scripting language, I would use a multi-stage approach.

  • pull out the NOSTRIP partitions and save them in an array and replace with markers (### or something else)
  • replace all spaces
  • reinsert all your saved NOSTRIP snippets
+2
source

I once created a set of functions to reduce spaces in html outputs:

 function minify($html) { if(empty($html)) { return $html; } $html = preg_replace('/^(.*)((<pre.*<\/pre>)(.*?))?$/Ues', "parse('$1').'$3'.minify('$4')", $html); return $html; } function parse($html) { var_dump('1'.$html); // Replace multiple spaces with a single space $html = preg_replace('/(\s+)/m', ' ', $html); // Remove spaces that are followed by either > or < $html = preg_replace('/ ([<>])/', '$1', $html); $html = str_replace('> ', '>', $html); return $html; } $html = minify($html); 

You may need to tweak this a bit to fit your needs.

+1
source

All Articles