Java is the fastest way to check string size

I have the following code inside a loop statement.
In a loop, strings are added to sb (StringBuilder) and checked to see if sb has reached 5 MB.

if (sb.toString().getBytes("UTF-8").length >= 5242880) { // Do something } 

This works great, but very slow (in terms of size checking)
What would be the fastest way to do this?

+7
java java-8 utf-8
source share
3 answers

You can quickly calculate the length of UTF-8 using

 public static int utf8Length(CharSequence cs) { return cs.codePoints() .map(cp -> cp<=0x7ff? cp<=0x7f? 1: 2: cp<=0xffff? 3: 4) .sum(); } 

If ASCII characters dominate content, it might be a little faster to use

 public static int utf8Length(CharSequence cs) { return cs.length() + cs.codePoints().filter(cp -> cp>0x7f).map(cp -> cp<=0x7ff? 1: 2).sum(); } 

instead.

But you can also consider the optimization potential not of recalculating the entire size, but only the size of the new fragment added to StringBuilder , something similar

  StringBuilder sb = new StringBuilder(); int length = 0; for(โ€ฆ; โ€ฆ; โ€ฆ) { String s = โ€ฆ //calculateNextString(); sb.append(s); length += utf8Length(s); if(length >= 5242880) { // Do something // in case you're flushing the data: sb.setLength(0); length = 0; } } 

This assumes that if you add fragments containing surrogate pairs, they are always full and not halved. For regular applications, this should always be.

An additional feature suggested by Didier-L is to delay the calculation until your StringBuilder reaches a threshold divided by three, as before, it is impossible to have a UTF-8 length longer than the threshold. However, this will only be useful if you do not reach threshold / 3 in some versions.

+8
source share

If you loop 1000 times, you will generate 1000String and then convert to a "UTF-8 Byte" array to get the length.

I would reduce the conversion by keeping the first length. Then, in each cycle, get only the length of the added value, then this is just an addition.

 int length = sb.toString().getBytes("UTF-8").length; for(String s : list){ sb.append(s); length += s.getBytes("UTF-8").length; if(...){ ... } } 

This will reduce memory usage and conversion cost.

+8
source share

Consider using ByteArrayOutputStream and OutputStreamWriter instead of StringBuilder. Use ByteArrayOutputStream.size () to check the size.

+2
source share

All Articles