Is Java Regular Expression Efficiency Better Somewhat Complex or Many Simple?

Question

Is Java Regular Expression Efficiency Better Somewhat Complex or Many Simple?

I do fairly extensive string manipulations using regular expressions in Java. I currently have many blocks of code that look something like this:

Matcher m = Pattern.compile("some pattern").matcher(text); StringBuilder b = new StringBuilder(); int prevMatchIx = 0; while (m.find()) { b.append(text.substring(prevMatchIx, m.start())); String matchingText = m.group(); //sometimes group(n) //manipulate the matching text b.append(matchingText); prevMatchIx = m.end(); } text = b.toString()+text.substring(prevMatchIx);

My question is which of the two alternatives is more efficient (primarily time, but space to some extent):

1) Store a lot of existing blocks as described above (assuming there is no better way to handle such blocks - I cannot use simple replaceAll() , because groups should work).

2) Consolidation of blocks into one large block. Use "some pattern" , which is a combination of all the old block patterns using the | / alternation operator. Then use if / else if each of the corresponding patterns is processed inside the loop.

Thank you for your help!

+4

java performance regex

Noah_r-c Jul 22 '10 at 22:20

source share

5 answers

I would suggest caching templates and using a method that uses a cache.

Templates are very expensive to compile, so at least you only compile them once, and code reuse is used using the same method for each instance. Shame on the lack of closures, although this will make things a lot cleaner.

  private static Map<String, Pattern> patterns = new HashMap<String, Pattern>(); static Pattern findPattern(String patStr) { if (! patterns.containsKey(patStr)) patterns.put(patStr, Pattern.compile(patStr)); return patterns.get(patStr); } public interface MatchProcessor { public void process(String field); } public static void processMatches(String text, String pat, MatchProcessor processor) { Matcher m = findPattern(pat).matcher(text); int startInd = 0; while (m.find(startInd)) { processor.process(m.group()); startInd = m.end(); } }

+2

Don mackenzie Jul 22 '10 at 23:01

source share

The last time I was in your position, I used a product called jflex .

Java regex does not provide traditional O (N log M) performance guarantees for true regex engines (for input strings of length N and patterns of length M). Instead, it inherits exponential time from some perl roots for some patterns. Unfortunately, these pathological patterns, although rare in normal use, are too common when combining the regular expressions that you propose to do (I can confirm this from personal experience).

Therefore, my advice is as follows:

a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or

b) switch to a lexer package, for example jflex , which will provide a more declarative and much more readable syntax for approaching such cascading / sequential processing of regular expressions; and

c) seriously consider using a parser generator package. My current favorite is Beaver , but CUP is also a good option. Both of them are great tools, and I highly recommend them both, and since they both sit on top of jflex, you can add them as / when you need them.

If you have not used the parser generator and you are in a hurry, it will be easier for you to speed up with JavaCC . Not as much as Beaver / CUP, but its parsing model is easier to understand.

Whatever you do, do not use Antlr. This is very fashionable and it has great fans, but its online documentation is crap, its syntax is inconvenient, its performance is poor, and its design without a scanner makes some common simple cases painful to process. You would be better off using an abomination like sablecc (v1).

Note. Yes, I used everything that I mentioned above, and more than that; therefore, this advice comes from personal experience.

+1

Recurse Jul 23 '10 at 1:42

source share

Firstly, is it necessary effectively? If not, don’t worry - complexing will not help repair the code.

Assuming that this is done, their implementation individually is usually the most effective. This is especially true if there are large blocks of text in the expressions: without alternation, this can be used to speed up the comparison, while it cannot help at all.

If performance is really important, you can program it in several ways and test sample data.

0

Charles Jul 22 '10 at 22:25

source share

Option # 2 is by far the best way to go, assuming it's not difficult to combine regular expressions. And you also do not need to implement it from scratch; The lower level API, which is built on replaceAll() (i.e. appendReplacement() and appendTail() ), is also available for your use.

Taking the example that @mangst used, here is how you can process some text that needs to be inserted into an XML document:

 import java.util.regex.*; public class Test { public static void main(String[] args) { String test_in = "One < two & four > three."; Pattern p = Pattern.compile("(&)|(<)|(>)"); Matcher m = p.matcher(test_in); StringBuffer sb = new StringBuffer(); // (1) while (m.find()) { String repl = m.start(1) != -1 ? "&amp;" : m.start(2) != -1 ? "&lt;" : m.start(3) != -1 ? "&gt;" : ""; m.appendReplacement(sb, ""); // (2) sb.append(repl); } m.appendTail(sb); System.out.println(sb.toString()); } }

In this very simple example, everything I need to know about each match in which the capture group that I learned using the start(n) method participates. But you can use the group() or group(n) method to check for matching text, as you mentioned in the question.

Note (1) . As for JDK 1.6, we should use StringBuffer here because StringBuilder did not exist when the Matcher class was written. JDK 1.7 will add StringBuilder support, as well as some other improvements.

Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any sequence $n contents of the nth capture group. We do not want this to happen, so we pass it an empty string, and then append() replacement string.

0

Alan moore Jul 23 '10 at 0:46

source share

Michael · Accepted Answer · 2010-07-22T22:39:05+0000

If the order in which the substitutions occur is important, you should be careful when using Technique No. 1. Let me give you an example: if I want to format String, so it is suitable for inclusion in XML, I must first replace all & with & and then make other replacements (for example, < to < ). Using technique # 2, you don’t have to worry about this because you are doing all the replacements in one go.

In terms of performance, I think # 2 will be faster because you will be doing less String concatenations. As always, you can implement both methods and record their speed and memory consumption to find out for sure. :)

Is Java Regular Expression Efficiency Better Somewhat Complex or Many Simple?

More articles: