I understand that since Regex is essentially stateless, it is quite difficult to achieve complex matches without resorting to supplementing the application logic, but I am curious to know if the following is possible.
Match all spaces, easy enough: \s+
But skip the spaces between specific delimiters, in my case <pre> and </pre> word nostrip .
Are there any tricks to achieve this? I thought of the lines of two separate matches, one for all spaces, and one for <pre> blocked the nostrip sections and somehow denied the last of the first.
"This is some text NOSTRIP this is more text NOSTRIP some more text."
Nested tags nostrip partitions do n't matter, and I'm not trying to parse an HTML tree or anything else , just embellishing the text file , but keeping spaces in <pre> blocks nostrip partitions for obvious reasons.
(it is better?)
This is ultimately what I went with. I am sure that it can be optimized in several places, but now it works well.
public function stripWhitespace($html, Array $skipTags = array('pre')){ foreach($skipTags as &$tag){ $tag = "<{$tag}.*?/{$tag}>"; } $skipped = array(); $buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si', function($match) use(&$skipped){ $skipped[] = $match['tag']; return "\x1D" . (count($skipped) - 1) . "\x1D"; }, $html ); $buffer = preg_replace('#\s+#si', ' ', $buffer); $buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer); for($i = count($skipped) - 1; $i >= 0; $i--){ $buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer); } return $buffer; }