You can quickly calculate the length of UTF-8 using
public static int utf8Length(CharSequence cs) { return cs.codePoints() .map(cp -> cp<=0x7ff? cp<=0x7f? 1: 2: cp<=0xffff? 3: 4) .sum(); }
If ASCII characters dominate content, it might be a little faster to use
public static int utf8Length(CharSequence cs) { return cs.length() + cs.codePoints().filter(cp -> cp>0x7f).map(cp -> cp<=0x7ff? 1: 2).sum(); }
instead.
But you can also consider the optimization potential not of recalculating the entire size, but only the size of the new fragment added to StringBuilder , something similar
StringBuilder sb = new StringBuilder(); int length = 0; for(โฆ; โฆ; โฆ) { String s = โฆ
This assumes that if you add fragments containing surrogate pairs, they are always full and not halved. For regular applications, this should always be.
An additional feature suggested by Didier-L is to delay the calculation until your StringBuilder reaches a threshold divided by three, as before, it is impossible to have a UTF-8 length longer than the threshold. However, this will only be useful if you do not reach threshold / 3 in some versions.
Holger
source share