Nested regular expressions and lookbehind

Question

Nested regular expressions and lookbehind

I'm having problems with the nested '+' / '-' lookahead / lookbehind in regex.

Let's say that I want to change the '*' in a string using '%' and say that '\' executes the next character. (Rotating regex in sql as ^^ command).

So the line

'*test*' should be changed to '%test%' ,
'\\*test\\*' → '\\%test\\%' , but
'\*test\*' and '\\\*test\\\*' must remain unchanged.

I tried:

 (?<!\\)(?=\\\\)*\* but this doesn't work (?<!\\)((?=\\\\)*\*) ... (?<!\\(?=\\\\)*)\* ... (?=(?<!\\)(?=\\\\)*)\* ...

Which regular regular expression matches the '* character in the above examples?

What is the difference between (?<!\$?=\\\$*)\* and (?=(?<!\\)(?=\\\\)*)\* or, if they are essentially wrong, the difference between a regular expression that has such a visual design?

+7

regex perl lookbehind regex-lookarounds lookahead

bliof Oct 23 '11 at 15:45

source share

5 answers

Well, since Tim decided not to update his regex with my suggested mods (and Tomalak's answer is not so optimized), here is my recommended solution:

Replace: ((?<!\\)(?:\\\\)*)\* with $1%

Here it is in the form of a PHP snippett comment:

 // Replace all non-escaped asterisks with "%". $re = '% # Match non-escaped asterisks. ( # $1: Any/all preceding escaped backslashes. (?<!\\\\) # At a position not preceded by a backslash, (?:\\\\\\\\)* # Match zero or more escaped backslashes. ) # End $1: Any preceding escaped backslashes. \* # Unescaped literal asterisk. %x'; $text = preg_replace($re, '$1%', $text);

Application: JavaScript not visible solution

The above solution requires lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:

 text = text.replace(/(\\[\S\s])|\*/g, function(m0, m1) { return m1 ? m1 : '%'; });

This solution replaces each instance of the backslash with itself and each instance * an asterisk with the % percent sign.

Edit 2011-10-24: The Javascript version has been fixed for the correct handling of cases such as: **text** . (Thanks to Alan Moore for pointing out the bug in the previous version.)

+8

ridgerunner Oct 23 '11 at 16:46

source share

Others have shown how this can be done with lookbehind, but I would like to make a case where you are not using images. Consider this solution ( here ):

 s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;

The bulk of the regular expression [^*\\]*(?:\\.[^*\\]*)* is an example of the Friedl idiom "unrolled loop". It consumes as much as it can for individual characters except an asterisk or backslash, or a pair of characters consisting of a backslash followed by something. This avoids the use of unshielded asterisks, no matter how many resettable backslashes (or other characters) are preceded by them.

\G binds each match to the position in which the previous match ended, or to the beginning of the input if this is the first attempt to match. This does not allow the regex engine to simply skip escaped backslashes and match unscreened asterisks anyway. Thus, each iteration of the controlled correspondence /g consumes everything until the next unshielded asterisk, capturing everything except the asterisk in group No. 1. Then this connected back and * is replaced by % .

I think this is at least as readable as search approaches, and easier to understand. It requires \G support, so it won’t work in JavaScript or Python, but it works fine in Perl.

+5

Alan moore Oct 23 '11 at 23:39

source share

So, you essentially want to match * only if it is preceded by an even number of backslashes (or, in other words, if it is not escaped)? Then you don’t have to look at all, because you only look back, right?

Search

 (?<=(?<!\\)(?:\\\\)*)\*

and replace with % .

Explanation:

 (?<= # Assert that it possible to match before the current position... (?<!\\) # (unless there are more backslashes before that) (?:\\\\)* # an even number of backslashes ) # End of lookbehind \* # Then match an asterisk

+3

Tim pietzcker Oct 23 '11 at 16:01

source share

The problem of detecting resettable backslashes in a regular expression has fascinated me several times, and only recently I realized that I was exaggerating it completely. There are several things that make it simpler, and as far as I can tell, no one here has noticed them yet:

Backslashes escape any character after them, and not just other backslashes. Thus, (\\.)* Will eat a whole string of escaped characters, regardless of whether they are a backslash or not. You do not need to worry about worms with even or odd numbers; just check the single \ at the beginning or end of the chain ( JavaScript solution> uses this).
Workarounds are not the only way to make sure that you start with the first backslash in the chain. You can simply search for a backslash character (or the beginning of a line).

The result is a short, simple template that does not require backreferences or callbacks, and is shorter than anything else I see so far.

 /(?!<\\)(\\.)*\*/g

And the replacement string:

 "$1%"

This works in .NET , which allows lookbehinds, and it should work for you in Perl. This can be done in JavaScript, but without lookbehinds or \G anchors, I see no way to do this in one layer. Ridgerunner's callback should work, like a loop:

 var regx = /(^|[^\\])(\\.)*\*/g; while (input.match(regx)) { input = input.replace(regx, '$1$2%'); }

There are many names here that I learn from other questions related to regular expression, and I know that some of you are smarter than me. If I made a mistake, say so.

0

Justin morgan Oct 16 '12 at 20:38

source share

Tomalak · Accepted Answer · 2011-10-23T16:16:09+0000

To find an unescaped character, you must look for a character preceded by an even number (or zero) of escape characters. This is relatively straightforward.

 (?<=(?<!\\)(?:\\\\)*)\* # this is explained in Tim Pietzcker' answer

Unfortunately, many regex engines do not support variable-length appearance, so we must substitute a prediction:

 (?=(?<!\\)(?:\\\\)*\*)(\\*)\* # also look at ridgerunner improved version

Replace this with the contents of the character of group 1 and % .

Explanation

 (?= # start look-ahead (?<!\\) # a position not preceded by a backslash (via look-behind) (?:\\\\)* # an even number of backslashes (don't capture them) \* # a star ) # end look-ahead. If found, ( # start group 1 \\* # match any number of backslashes in front of the star ) # end group 1 \* # match the star itself

Looking ahead ensures that only even numbers of backslashes are taken into account. In any case, there is no way to match them in a group, since look-forward does not advance a position in a line.

Nested regular expressions and lookbehind

Application: JavaScript not visible solution

More articles: