Why is this regex hanging with two backslashes but works fine with one?

I have the following code using regex that hangs on the last line at startup:

const char PathSeparator = '\\'; const string PathPrefix = "\\\\"; var reg = new Regex(string.Format("^{0}(([^{1}\r\n\v\f\u2028\u2029]+){1}?)+$", Regex.Escape(PathPrefix), Regex.Escape(PathSeparator.ToString()))); var test = @"\\Inbox\Test123\Intermediate\3322FB69-FE3F-407E-9E15-584382A407A2\\"; var matches = reg.Matches(test); var group = matches[0].Groups[2]; 

It works fine if I remove one of the last backslashes in test , for example.

 var test = @"\\Inbox\Test123\Intermediate\3322FB69-FE3F-407E-9E15-584382A407A2\"; 

Can someone please help me figure out why it hangs? I know I can set a timeout; This is not my question.

+6
source share
1 answer

Now you have two problems

I have to start this answer with the obligatory mention of the famous quote . While regex is a powerful tool that can solve many problems (including this one) effectively, it can also become a big source of problems if you are not an expert on this issue and do not hurt, which could be completely avoid technical solutions.

In this case, you can trivially perform the same task by dividing the input string by slashes, and then checking each marker created by itself (yes, maybe even with a regular expression). And in fact, this can be a reasonable step in this, because it will significantly reduce the coefficient of complexity of the solution; something that will be very useful in the future when a β€œlittle setup” (which increases complexity) is required.

However, let's look at interesting things: what happened?

Why does it work forever?

Due to excessive backtracking caused by nested quantifiers + . This will be easier to illustrate using a more manual sample entry.

 \\Foo\Bar\Baz\\ 

See what the regex engine will try to match when submitting this input.

1) First try: the innermost group matches Foo\ and Bar\ and Baz\ . Further repetition is not performed due to the final \ , then the end-of-entry marker does not fit, so the attempt is rejected and the engine backs off.

Note that if the trailing slash was not there, the regex engine will declare success and will return at that point.

2) Since the slash in the inner group is optional, the following attempt: Foo\ , Bar\ , Baz , which also obviously fails. More backtracking.

3) The next candidate corresponds to four repetitions of the innermost group: Foo\ , Bar\ , Ba and z\ . Error, more digressions.

From now on, I simply enumerated attempts at matches, dividing the repetitions of the innermost group into one space:

 Foo\ Bar\ Ba z (ie 4 repetitions of length 4, 4, 2, 1, and fails) Foo\ Bar\ Ba Foo\ Bar\ B az\ Foo\ Bar\ B az Foo\ Bar\ B a Foo\ Bar\ 

And so on.

It should be noted here that an unreasonable number of attempts were made to even exclude that the final segment of Baz could be part of a successful match: we had to consider Baz, Ba + z, Ba, B + az "," B "+" a "and" B "; in the general case, for a segment of length N, the number of possibilities is N! (factorial). The growth of factor function goes beyond the intuitive understanding of man as Wikipedia illustrates .

Since your sample input contains a final segment of length 36, it is obvious that the attempt to match this regular expression with an incorrectly formatted input will never end.

How to prevent this?

In this case, it is quite simple. Since you know this, for example, Baz, it makes no sense to try to break it into smaller pieces if its coincidence as a whole fails (because these pieces should be separated by a slash, which we already know, this is not because the slashes are not a permitted part of an attempt to match), use the atom capture group:

 var reg = new Regex(string.Format( "^{0}(?>([^{1}\r\n\v\f\u2028\u2029]+{1}?))+$", Regex.Escape(PathPrefix), Regex.Escape(PathSeparator.ToString()))); 

This will only cancel 1 time instead of N! for each segment of the path in a bad match, reducing the time to failure to zero.

+3
source

All Articles