Find out which group matches a Java regular expression without a linear search?

I have a programmatically built huge regex like this

(A)|(B)|(C)|... 

Each sub-matrix is ​​in its capture group. When I get a match, how do I determine which group matches without linear testing each group(i) to see that it returns a non-empty string?

+4
source share
5 answers

If your regular expression is programmed generated, why not programmatically generate n separate regular expressions and check each of them in turn? If they do not have a common prefix, and the Java regex mechanism is smart, all alternatives will still be tested.

Update: I just looked at the source of Sun Java, specifically java.util.regex.Pattern $ Branch.match (), and also just doing a linear search on all alternatives, each of which tries in turn. Other places where the branch is used do not offer any optimization of common prefixes.

+4
source

You can use non-exciting groups rather than:

(A) | (B) | (C) | ...

replace

((?: A) | (: B) | (: C))

Insufficient groups (? :) will not be included in the number of groups, but the result of the branch will be fixed in the external () group.

+1
source

Divide your regex into three:

 String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" }; for(int i = 0; i < regexes.length; i++) { Pattern pattern = Pattern.compile(regexes[i]); Matcher matcher = pattern.matcher(inputStr); if(matcher.matches()) { //process, optionally break out of loop } } public int getMatchedGroupIndex(Matcher matcher) { int index = -1; for(int i = 0; i < matcher.groupCount(); i++) { if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) { index = i; } } return index; } 

Alternative option:

 for(int i = 0; i < matcher.groupCount(); i++) { if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) { //process, optionally break out of loop } } 
0
source

I don't think you can get around the linear search, but you can make it much more efficient by using start(int) instead of group(int) .

 static int getMatchedGroupIndex(Matcher m) { int index = -1; for (int i = 1, n = m.groupCount(); i <= n; i++) { if ( (index = m.start(i)) != -1 ) { break; } } return index; } 

Thus, instead of generating a substring for each group, you simply request an int value representing its starting index.

0
source

From various comments, it seems that the simple answer is no, and that using separate regular expressions is the best idea. To improve this approach, you may need to figure out common template prefixes when creating them or use your own regular expression mechanism (or another). But before you go to all these efforts, you must be sure that this is a significant bottleneck in your system. In other words, compare it and see if the performance is acceptable for realistic input, and if there is no profile to see where the real bottlenecks are.

0
source

All Articles