Java is the fastest way to check string size

Question

Java is the fastest way to check string size

I have the following code inside a loop statement.
In a loop, strings are added to sb (StringBuilder) and checked to see if sb has reached 5 MB.

if (sb.toString().getBytes("UTF-8").length >= 5242880) { // Do something }

This works great, but very slow (in terms of size checking)
What would be the fastest way to do this?

+7

java java-8 utf-8

d -_- b Apr 24 '17 at 11:12

source share

3 answers

If you loop 1000 times, you will generate 1000String and then convert to a "UTF-8 Byte" array to get the length.

I would reduce the conversion by keeping the first length. Then, in each cycle, get only the length of the added value, then this is just an addition.

 int length = sb.toString().getBytes("UTF-8").length; for(String s : list){ sb.append(s); length += s.getBytes("UTF-8").length; if(...){ ... } }

This will reduce memory usage and conversion cost.

+8

Axelh Apr 24 '17 at 11:17

source share

Consider using ByteArrayOutputStream and OutputStreamWriter instead of StringBuilder. Use ByteArrayOutputStream.size () to check the size.

+2

Maurice Perry Apr 24 '17 at 13:38

source share

Holger · Accepted Answer · 2017-04-24T13:01:07+0000

You can quickly calculate the length of UTF-8 using

 public static int utf8Length(CharSequence cs) { return cs.codePoints() .map(cp -> cp<=0x7ff? cp<=0x7f? 1: 2: cp<=0xffff? 3: 4) .sum(); }

If ASCII characters dominate content, it might be a little faster to use

 public static int utf8Length(CharSequence cs) { return cs.length() + cs.codePoints().filter(cp -> cp>0x7f).map(cp -> cp<=0x7ff? 1: 2).sum(); }

instead.

But you can also consider the optimization potential not of recalculating the entire size, but only the size of the new fragment added to StringBuilder , something similar

  StringBuilder sb = new StringBuilder(); int length = 0; for(…; …; …) { String s = … //calculateNextString(); sb.append(s); length += utf8Length(s); if(length >= 5242880) { // Do something // in case you're flushing the data: sb.setLength(0); length = 0; } }

This assumes that if you add fragments containing surrogate pairs, they are always full and not halved. For regular applications, this should always be.

An additional feature suggested by Didier-L is to delay the calculation until your StringBuilder reaches a threshold divided by three, as before, it is impossible to have a UTF-8 length longer than the threshold. However, this will only be useful if you do not reach threshold / 3 in some versions.

Java is the fastest way to check string size

More articles: