Java.lang.StackOverflowError on java.util.regex.Pattern $ BmpCharProperty.match (Pattern.java:3715)

I get a StackOverflowError when I use after Reg Ex:

 "([AZ][AZ]\\d\\d[AZ]\\[(\\*|(((\\d|\\d\\d)-(\\d|\\d\\d))|(\\d|\\d\\d)))\\](,|$))+"; 

to map something like this String :

 RA01D[1-1],RA01D[17-17],RA01D[2-2],RA01D[18-18] 
+5
source share
2 answers

What stribizhev's answer indicated and recorded inefficiency in regular expression. There is no catastrophic retreat. The change only slightly delays the StackOverflowError without its permission (see Appendix ).

In the original regular expression, if the first branch (\d|\d\d)-(\d|\d\d) fails, the second branch will again do the additional work corresponding to (\d|\d\d) , which is prefix of the first branch.

 ( ( (\d|\d\d)-(\d|\d\d) ) | (\d|\d\d) ) 

When re-writing (as shown in his answer) the prefix (\d|\d\d) will be matched only once, and the engine should check only two different sequels (match -(\d|\d\d) or just an empty string )

 (\d|\d\d)(?:-(\d|\d\d))? 

Here's how his answer improves the effectiveness of regular expressions. The same method applies to \d|\d\d .


Back to the StackOverflowError problem. If you run a regular expression in a string with 1000 elements, any of the above expressions will raise a StackOverflowError . This is due to the implementation of the Pattern class in Sun / Oracle / OpenJDK, which uses recursion for the greedy and lazy quantifier.

Since the regular expression is not ambiguous , you can fix this problem by doing a quantifier on an external large group. The regular expression is copied from the stribizhev response with some changes:

 "(?:[AZ][AZ]\\d\\d[AZ]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))++" ^^ 

Because the implementation uses a loop to implement a quantifier (since there is no need to return), a StackOverflowError cannot occur, no matter how long the input line takes. Using a stack is just one repetition, unlike the case in the question where it grows linearly with the number of elements in a row.

application

Testing program

The following is a test program showing the number of elements that a regular expression can handle. On my system (Oracle JRE, version 1.8.0_25), the regular expression in the question manages to reach 104 * 4 = 416 elements before the failure, stribizhev the answer manages to reach 137 * 4 = 548, the stribizhev answer is changed to remove unnecessary group commands up to 150 * 4 = 600, and the version with possessive quantifier passes all tests (500 * 4 = 2000 elements).

 public class SO29758814 { public static void main(String[] args) { String s = ""; for (int i = 1; i <= 500; i++) { s += "RA01D[1-1],RA01D[17-17],RA01D[2-2],RA01D[18-18],"; System.out.print(i); try { // Question System.out.print(" 1 " + s.matches("([AZ][AZ]\\d\\d[AZ]\\[(\\*|(((\\d|\\d\\d)-(\\d|\\d\\d))|(\\d|\\d\\d)))\\](,|$))+")); } catch (Throwable e) { } try { // stribizhev answer System.out.print(" 2 " + s.matches("([AZ]{2}\\d{2}[AZ]\\[(\\*|((\\d{1,2})(?:-(\\d{1,2}))?))\\](?:,|$))+")); } catch (Throwable e) { } try { // stribizhev answer, remove unnecessary groups System.out.print(" 3 " + s.matches("(?:[AZ][AZ]\\d\\d[AZ]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))+")); } catch (Throwable e) { } try { // stribizhev answer, remove unnecessary groups, use possessive quantifier System.out.print(" 4 " + s.matches("(?:[AZ][AZ]\\d\\d[AZ]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))++")); } catch (Throwable e) { } System.out.println(); } } } 
+4
source

Your regex contains alternative lists that have similar patterns that often lead to catastrophic back-bounce and can affect performance. Take a look at this template:

  ( ( (\d|\d\d)-(\d|\d\d) ) | (\d|\d\d) ) 

He is equal

 ( ( (\d|\d\d)(?:-(\d|\d\d))? ) ) 

In addition, you are better off using quantifiers, (\d|\d\d) equals \d{1,2} . I also doubt that you need to grab a comma or end of line, so add a group without grabbing (?:,|$) .

So try using this regex ( see demo here )

 ([AZ]{2}\d{2}[AZ]\[(\*|((\d{1,2})(?:-(\d{1,2}))?))\](?:,|$))+ 

Or as a Java string:

 String pattern = "([AZ]{2}\\d{2}[AZ]\\[(\\*|((\\d{1,2})(?:-(\\d{1,2}))?))\\](?:,|$))+"; 

You can also customize the number of capture groups.

+1
source

All Articles