Regular Expression Freeze Program (100% CPU Usage)

Java hangs with 100% CPU usage when I use the line below as input for regular expression.

Used RegEx:

Here is the regular expression used for the description field in my application.

^([A-Za-z0-9\\-\\_\\.\\&\\,]+[\\s]*)+ 

String used for testing:

SaaS Service VLAN from Provider_One
The second attempt with Didier SPT, because the first one he gave me was wrong: - (

It works correctly when I split the same line in different combinations. Like the "SaaS Service VLAN from Provider_One", "the first one he gave me was wrong :-(" etc. Java only hangs for the above line.

I also tried to optimize the regex as shown below.

 ^([\\w\\-\\.\\&\\,]+[\\s]*)+ 

Even so it doesn’t work.

+25
java regex
Jul 17 2018-12-17T00:
source share
3 answers

Another classic case of catastrophic backtracking .

You have nested quantifiers that cause a gigantic amount of permutations to be checked when the regular expression comes in : in your input line, which is not part of your character class (if you use the .matches() method).

Let the task to this regular expression be simplified:

 ^([^:]+)+$ 

and this line:

 1234: 

The regex engine should check

 1234 # no repetition of the capturing group 123 4 # first repetition of the group: 123; second repetition: 4 12 34 # etc. 12 3 4 1 234 1 23 4 1 2 34 1 2 3 4 

... and it's just for four characters. On your sample line, RegexBuddy is aborted after 1 million attempts. Java will happily continue to interrupt ... before finally recognizing that none of these combinations allow matching the following :

How can you solve this?

You can forbid regex with return using possessive quantifiers :

 ^([A-Za-z0-9_.&,-]++\\s*+)+ 

will allow regular expressions to work faster. By the way, I removed all unnecessary backslashes.

Edit:

Several measurements:

In the line "was wrong :-)" to calculate the discrepancy, you must follow the steps of RegexBuddy 862.
For "me was wrong :-)" this is 1,742 steps.
For "gave me was wrong :-)" , 14014 steps.
For "he gave me was wrong :-)" , 28,046 steps.
For "one he gave me was wrong :-)" , 112,222 steps.
For "first one he gave me was wrong :-)" ,> 1,000,000 steps.

+58
Jul 17 '12 at 13:11
source share

First, you need to understand that your regular expressions CANNOT match the input line you entered. Strings contain numeric characters ( '<' '>' '/' ':' and ')' ), which are not characters of the word.

So why is it taking so long?

Basically a "catastrophic rollback." More specifically, the repetitive patterns of your regex provide an exponential number of alternatives for the regex tracing algorithm to try!

Here is what your regular expression says:

  • One or more characters of the word
  • Following zero or more space characters
  • Repeat the previous 2 patterns as many times as you like.

The problem is that the part is “zero or more spaces”. The first time a match will match everything until the first unexpected character (i.e. '<' ). Then he will back off a bit and try again with another alternative ... which includes “zero spaces” before the last letter, and then when that fails, it will move the “zero spaces” back one position.

The problem is that for a String with N non-spatial characters, there are N different places that can be mapped to “null spaces” and that makes 2^N different combinations. This quickly turns into a HUGE number when N grows, and the end result is difficult to distinguish from an infinite loop.

+15
Jul 17 '12 at 13:11
source share

Why do you match spaces separately from other characters? And why do you set the match at the beginning, but not at the end? If you want the line not to start or end with a space, you should do something like this:

 ^[A-Za-z0-9_.&,-]+(?:\s+[A-Za-z0-9_.&,-]+)*$ 

Now there is only one “path” that the regex engine can take through a string. If a character that matches [A-Za-z0-9_.&,-] runs through before reaching the end, and the next character does not match \s , it will work immediately. If it reaches the end, still matching the space characters, it fails because for each space run it is necessary to combine at least one character without spaces.

If you want to make sure that there is only one whitespace separating runs without spaces, just remove the quantifier from \s+ :

 ^[A-Za-z0-9_.&,-]+(?:\s[A-Za-z0-9_.&,-]+)*$ 

If you don't care where the space refers to the non-space, just map them all to the same character class:

 ^[A-Za-z0-9_.&,\s-]+$ 

I assume that you know that your regular expression will not match the specified input due to : and ( in the emoticon, and you just want to know why it takes so long to fail.

And of course, since you are creating a regular expression in the form of a Java string literal, you should write:

 "^[A-Za-z0-9_.&,-]+(?:\\s+[A-Za-z0-9_.&,-]+)*$" 

or

 "^[A-Za-z0-9_.&,-]+(?:\\s[A-Za-z0-9_.&,-]+)*$" 

or

 "^[A-Za-z0-9_.&,\\s-]+$" 

(I know that you had a double backslash in the original question, but that was probably just to make them display correctly, since you did not use SO's excellent code formatting function.)

+4
Jul 17 '12 at 15:34
source share



All Articles