Regular expression for username increases CPU usage

Question

Regular expression for username increases CPU usage

We have the following rules for verifying a username:

Username may contain alphanumeric characters
The username may have an underscore, hyphen, or period.
Now suppose the username is in ASCII
Username cannot start or end with a period
The username cannot begin, end or have spaces.

To do this, we have the following regular expression:

^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$

Now, trying to match a specific string, CPU usage is growing exponentially. For example, for example:

 M45766235H.M96312865E@EXAMPLE.COM

Obviously, the String match, as stated above, should end instantly, but I want to know why it takes so many CPU cycles. Final code:

 import java.util.regex.*; public class R { static final Pattern namePattern = Pattern.compile("^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$"); public static void main(String... args) { final String userName = "M45766235H.M96312865E@EXAMPLE.COM"; Matcher matcher = namePattern.matcher(userName); System.out.println(matcher.matches()); } }

On different lines, I rewrote the regex as shown below and it works well:

 ^[\\w]+[\\w-_\\.]*[\\w]+$

+8

java regex

user320599 Apr 29 '13 at 20:16

source share

1 answer

dasblinkenlight · Accepted Answer · 2013-04-29T21:03:29+0000

When regex engines use backtracking a lot, the matching process becomes very slow. Rollback happens very often when you allow different parts of your expression to match the particles of your input, especially when there is no match: the engine tries to separate the possibilities of separation between parts of your regular expression.

Consider this simple example from your regular expression: use [a-zA-Z0-9]+[_-]*[a-zA-Z0-9]* to match M45766235H. . Please note that there are two subexpressions that can cover all characters, starting from the second: the engine must consider all these features:

 [a-zA-Z0-9]+ [a-zA-Z0-9]* ------------ ------------ M45766235H <nothing> M45766235 H M4576623 5H M457662 35H M45766 235H M4576 6235H M457 66235H M45 766235H M4 5766235H M 45766235H

Given that there is no coincidence, these are ten useless repetitions right there. But this is not the end! When there are other subexpressions in the mix that can cause ambiguous coverage, these ten possible matches will be checked for each of the possible matches in the string.

To say that the consequences of retreating quickly add up is an understatement: they multiply exponentially, eventually consuming a lot of your processor.

The moral of this story is to try to reduce the number of digressions. For example, your second expression

 ^[\\w]+[\\w-_\\.]*[\\w]+$

can be rewritten as follows:

 ^\\w[\\w-_\\.]*\\w$

The two expressions will match the same set of inputs, but if there is a match, the second expression will have a unique match, while the original expression will have approximately (N-2)^3 different ways to split the matched string among the three sub-expressions corresponding to the characters of words.

Here are some more related articles: Efficiency of greedy versus lazy quantifiers .

Regular expression for username increases CPU usage

More articles: