Java regular expressions using pattern and pattern

Question

Java regular expressions using pattern and pattern

My question is about regular expressions in Java, and in particular, a few matches for a particular search pattern. All the information I need to get is on one line and contains an alias (for example, SA) that maps to an IP address. Each of them is separated by a comma. I need to extract each of them.

SA "239.255.252.1", SB "239.255.252.2", SC "239.255.252.3", SD "239.255.252.4"

My Reg Ex looks like this:

 Pattern alias = Pattern.compile("(\\S+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\""); Matcher match = alias.matcher(lineInFile) while(match.find()) { // do something }

This works, but I'm not quite happy with it, because by introducing this small piece of code, my program slowed down a little (<1 s), but enough to notice the difference.

So my question is: am I doing this right? Is there a more effective or possibly easy solution without the need for some cycle (coincidence)? and / or Pattern / Matcher classes?

+4

java regex matcher

Wilko Sep 29 '10 at 9:03

source share

6 answers

Joachim sauer · Answer 1 · 2010-09-29T09:28:46+0000

If a string cannot contain anything other than defining an alias, then using .match() instead of .find() can speed up the search by character.

Sean Patrick Floyd · Answer 2 · 2010-09-29T09:36:44+0000

I'm afraid your code looks pretty efficient. Here is my version:

 Matcher match = Pattern .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"") .matcher(lineInFile); while(match.find()) { //do something }

There are two microoptimizations:

No need to save the template in an additional variable, underlined that
For an alias, word search characters rather than space characters

In fact, if you process so much and the template never changes, you should save the compiled template in a constant:

 private static final Pattern PATTERN = Pattern .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\""); Matcher match = PATTERN.matcher(lineInFile); while(match.find()) { //do something }

Update: I spent some time on RegExr to create a much more specific template, which should only identify valid IP addresses as a bonus. I know this is ugly as hell, but I assume it is quite efficient as it eliminates most of the backtracking:

 ([AZ]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.) {3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))

(Wrapped for readability, all backslashes should be escaped in java, but you can test it in RegExr, as in the OP test string)

dogbane · Answer 3 · 2010-09-29T09:38:46+0000

You can improve the regular expression to: "(\\S{2})\\s+\"((\\d{1,3}\\.){3}\\d{1,3})\"" by specifying the IP address more explicitly.

Try using StringTokenizer . It does not use regular expressions. (If you are concerned about using an inherited class, take a look at its source and see how it is done.)

 StringTokenizer st = new StringTokenizer(lineInFile, " ,\""); while(st.hasMoreTokens()){ String key = st.nextToken(); String ip = st.nextToken(); System.out.println(key + " ip: " + ip); }

Tim pietzcker · Answer 4 · 2010-09-29T10:00:01+0000

I don’t know if this will bring a big performance advantage, but you can also do first

 string.split(", ") // separate groups

and then

 string.split(" ?\"") // separate alias from IP address

in matches.

Stephen c · Answer 5 · 2010-09-29T11:13:28+0000

Precompiling and reusing a Pattern (IMO) object may be most effective. Compiling templates is a potentially expensive step.

Reusing a Matcher instance (e.g. using reset(CharSequence) ) may help, but I doubt it will make a big difference.

The regular expression itself cannot be optimized significantly. One possible acceleration would be to replace (\d+\.\d+\.\d+\.\d+) with ([0-9\.]+) . This may help, as it reduces the number of potential bounce points ... but you will need to do some experimentation to be sure. And the obvious drawback is that it matches character sequences that are not valid IP addresses.

splash · Answer 6 · 2010-09-29T11:42:13+0000

If you notice a difference of <1 second on this piece of code, then your input line should contain about a million (from at least about 100 thousand) entries. I believe that this is pretty good performance, and I can’t understand how you could significantly optimize this without creating your own specialized parser.

Java regular expressions using pattern and pattern

More articles: