Java regex offers any performance benefit?

In Java, when we try to match patterns using a regular expression. for example, take an input line and use a regular expression to find out if it is numeric. If not, throw an exception. In this case, I understand that using regex makes the code less verbose than if we took each character of the string, check if it is a number and if it does not throw an exception.

But I was on the assumption that regular expression also makes the process more efficient. It's true? I can not find any evidence on this. How does regex perform a match backstage? Isn't it iterating over a line and checking each character one by one?

+8
java performance regex
source share
8 answers

Just for fun, I ran this micro test. The results of the last run (i.e., Post-JVM warm-up / JIT) are lower (the results in any case are quite consistent from one run to another):

regex with numbers 123 chars with numbers 33 parseInt with numbers 33 regex with words 123 chars with words 34 parseInt with words 733 

In other words, characters are very efficient, Integer.parseInt is as efficient as char IF a string is a number, but terribly slow if the string is not a number. Regex is in between.

Conclusion

If you parse a string into a number and you expect the string to be the whole number, using Integer.parseInt is the best solution (efficient and readable). The penalty you get when a string is not a number should be low if it is not too frequent.

ps: my regex may not be optimal, feel free to comment.

 public class TestNumber { private final static List<String> numbers = new ArrayList<>(); private final static List<String> words = new ArrayList<>(); public static void main(String args[]) { long start, end; Random random = new Random(); for (int i = 0; i < 1000000; i++) { numbers.add(String.valueOf(i)); words.add(String.valueOf(i) + "x"); } for (int i = 0; i < 5; i++) { start = System.nanoTime(); regex(numbers); System.out.println("regex with numbers " + (System.nanoTime() - start) / 1000000); start = System.nanoTime(); chars(numbers); System.out.println("chars with numbers " + (System.nanoTime() - start) / 1000000); start = System.nanoTime(); exception(numbers); System.out.println("exceptions with numbers " + (System.nanoTime() - start) / 1000000); start = System.nanoTime(); regex(words); System.out.println("regex with words " + (System.nanoTime() - start) / 1000000); start = System.nanoTime(); chars(words); System.out.println("chars with words " + (System.nanoTime() - start) / 1000000); start = System.nanoTime(); exception(words); System.out.println("exceptions with words " + (System.nanoTime() - start) / 1000000); } } private static int regex(List<String> list) { int sum = 0; Pattern p = Pattern.compile("[0-9]+"); for (String s : list) { sum += (p.matcher(s).matches() ? 1 : 0); } return sum; } private static int chars(List<String> list) { int sum = 0; for (String s : list) { boolean isNumber = true; for (char c : s.toCharArray()) { if (c < '0' || c > '9') { isNumber = false; break; } } if (isNumber) { sum++; } } return sum; } private static int exception(List<String> list) { int sum = 0; for (String s : list) { try { Integer.parseInt(s); sum++; } catch (NumberFormatException e) { } } return sum; } } 
+4
source share

I don't have a technical answer yet, but I could write the code and see. I don't think regular expressions would be a way of converting a string to a number. In many cases, they can be more effective, but if they are poorly written, it will be slow.

May I ask why you are not using: Integer.parseInt("124") ? This will throw a NumberFormatException. Should be able to handle this, and it leaves the number definition up to basic Java.

+3
source share

About regex behind the scenes ...

A finite state machine (FSM) is equivalent to a regular expression. FSM is a machine that can recognize a language (in your case, numbers). FSM has an alphabet, states, an initial state, N-terminal states, and transition functions from one state to another. The string must contain the alphabet (for example, ASCII). FSM starts in the initial state. When you enter a string, the char to char process goes from state to state depending on the state of the function (state, char) =>. When it reaches the final state, you know if the string is numeric or not.

See FSM and Automata-based_programming for details.

+1
source share

I do not see how it could be simpler or easier to read than:

Integer.parseInt()

or

Double.parseDouble()

They do exactly what you described, including throwing exceptions for invalid input.

Regarding performance: I expect the regex to be less efficient than the above.

+1
source share

Only my 5 cents :) In general, the regular expression language is not intended only for parsing integers or strings. This is a pretty powerful tool that allows you to recognize any "regular expression". It reminds me of my university time (remember the course of the theory of automatism? :), but here is a link that describes that in fact ordinary language

Now, since he creates FSM, he introduces some overhead, so maybe for Integer.parseInt engine is not a good replacement, moreover, Java introduced a more specific API. However, regular expressions have an advantage when working with more complex expressions and when there are many of them.

A regular expression should be used wisely. The template should always be compiled (otherwise it cannot be reused efficiently, since compiling the template each time reduces performance)

I would suggest running a test on a more complex input and see what happens.

+1
source share

Well, it's hard to say for sure, but in general, regular expressions are less likely to be more efficient than explicit character checking. REs are finite state machines; therefore, there is some overhead for the creation and maintenance of machines. In my practice, explicit code is always faster (and therefore more efficient) than regular expressions.

But here is the dilemma. Regular expressions are almost always more effective in terms of time to delivery and are more readable when used properly. And here is another dilemma. I rarely see the correct use of regular expressions ...

In your scenario, I suggest using the guava library:

 boolean isValid = DIGIT.matchesAllOf("1234"); 
0
source share

In the end, it really does iterate over the string and check each character trying to find a match for the provided pattern. In addition, it uses reverse tracking (if there are many ways that could match, the engine will try all of them), which can lead to very poor performance for some unusual cases (you are unlikely to come across this, but theoretically possible). In the worst case, the performance of the java regex engine is O (2 N ), where N is the length of the input string.

Algorithms exist for faster pattern matching, providing O (N) performance, but with fewer features than Java regular expressions.

Here is an article discussing this issue in detail.

But in most cases, the regex engine will not be the performance bottleneck in your application. It's fast enough, so usually donโ€™t worry about it unless your profiler points it out. And this gives a declarative description of the algorithm, which is very useful because almost always the implementation of the iterative algorithm will be much more verbose and much less readable.

0
source share

To answer your question specifically:

Why don't you apply the regular expression pattern matching to some complex text, and then try to write the same matching code yourself.

See which is faster.

Answer: Regular expression.

0
source share

All Articles