Efficient parsing of substring integers in Java

Question

Efficient parsing of substring integers in Java

AFAIK there is no efficient way in standard Java libraries for parsing an integer from a substring without actually creating a new string containing the substring.

I am in a situation where I process millions of integers from strings, and I don’t really want to create new strings for each substring. Copy overhead I do not need.

Given the string s, I need a method like:

parseInteger(s, startOffset, endOffset)

with semantics like:

 Integer.parseInt(s.substring(startOffset, endOffset))

Now I know that I can do this quite trivially:

 public static int parse(String s, int start, int end) { long result = 0; boolean foundMinus = false; while (start < end) { char ch = s.charAt(start); if (ch == ' ') /* ok */; else if (ch == '-') { if (foundMinus) throw new NumberFormatException(); foundMinus = true; } else if (ch < '0' || ch > '9') throw new NumberFormatException(); else break; ++start; } if (start == end) throw new NumberFormatException(); while (start < end) { char ch = s.charAt(start); if (ch < '0' || ch > '9') break; result = result * 10 + (int) ch - (int) '0'; ++start; } while (start < end) { char ch = s.charAt(start); if (ch != ' ') throw new NumberFormatException(); ++start; } if (foundMinus) result *= -1; if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE) throw new NumberFormatException(); return (int) result; }

But this is not so. I would rather get this from a trusted, supported third-party library. For example, parsing lengths and correctly processing Long.MIN_VALUE is a bit subtle, and I cheat on above by parsing ints in longs. And above there is still an overflow problem if the integer integer is greater than Long.MAX_VALUE.

Is there such a library?

My search has changed a bit.

+7

java string int parsing

Barry kelly Oct 15 '13 at 10:17

source share

3 answers

Have you profiled your application? Do you have a source of your problem?

Since Strings are immutable, there is a good chance that very little memory is required and very few operations are performed to create a substring.

If you really have problems with memory, garbage collection, etc., just use the substring method. Do not look for complex solutions to problems that you do not have.

In addition: if you implement something yourself, you can lose more than you win in terms of efficiency. Your code does a lot and is quite complicated - as for the default implementation, however you can be absolutely sure that it is relatively fast. And no mistakes.

+5

Dariusz Oct 15 '13 at 12:17

source share

I could not resist to measure the improvement of your method:

 package test; public class TestIntParse { static final int MAX_NUMBERS = 10000000; static final int MAX_ITERATIONS = 100; public static void main(String[] args) { long timeAvoidNewStrings = 0; long timeCreateNewStrings = 0; for (int i = 0; i < MAX_ITERATIONS; i++) { timeAvoidNewStrings += test(true); timeCreateNewStrings += test(false); } System.out.println("Average time method 'AVOID new strings': " + (timeAvoidNewStrings / MAX_ITERATIONS) + " ms"); System.out.println("Average time method 'CREATE new strings': " + (timeCreateNewStrings / MAX_ITERATIONS) + " ms"); } static long test(boolean avoidStringCreation) { long start = System.currentTimeMillis(); for (int i = 0; i < MAX_NUMBERS; i++) { String value = Integer.toString((int) Math.random() * 100000); int intValue = avoidStringCreation ? parse(value, 0, value.length()) : parse2(value, 0, value.length()); String value2 = Integer.toString(intValue); if (!value2.equals(value)) { System.err.println("Error at iteration " + i + (avoidStringCreation ? " without" : " with") + " string creation: " + value + " != " + value2); } } return System.currentTimeMillis() - start; } public static int parse2(String s, int start, int end) { return Integer.valueOf(s.substring(start, end)); } public static int parse(String s, int start, int end) { long result = 0; boolean foundMinus = false; while (start < end) { char ch = s.charAt(start); if (ch == ' ') /* ok */; else if (ch == '-') { if (foundMinus) throw new NumberFormatException(); foundMinus = true; } else if (ch < '0' || ch > '9') throw new NumberFormatException(); else break; ++start; } if (start == end) throw new NumberFormatException(); while (start < end) { char ch = s.charAt(start); if (ch < '0' || ch > '9') break; result = result * 10 + ch - '0'; ++start; } while (start < end) { char ch = s.charAt(start); if (ch != ' ') throw new NumberFormatException(); ++start; } if (foundMinus) result *= -1; if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE) throw new NumberFormatException(); return (int) result; } }

Results:

 Average time method 'AVOID new strings': 432 ms Average time method 'CREATE new strings': 500 ms

Your method is about 14% more efficient over time and presumably in memory, although quite complex (and error prone). From my point of view, your approach does not pay off, although it may be in your case.

+2

Fernando miguélez Oct 15 '13 at 13:34

source share

Thomas · Accepted Answer · 2013-10-15T12:13:02+0000

Don't worry too much about objects unless you run into performance issues. Use the current JVM, there are continuous improvements in terms of performance and memory size.

You can take a look at "ByteString" from the Google protocol buffers if you want the substring to use the following line:

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString#substring%28int,%20int%29

Efficient parsing of substring integers in Java

More articles: