To extend the answer to August, I decided that I had to explain exactly how this was compiled and what takes up space at the binary level.
To simplify, I will use the following example with shorter lines. The important part is that there are several duplicate lines.
public class StringArray { private static final String[] stringArray = { "AAA", "AAB", "AAA", "AAC", "AAA" }; } public class LongString { private static final String longString = "AAA" + "AAB" + "AAA" + "AAC" + "AAA" ; }
Now, when compiling this code, you need to understand three important things.
- Consolidation of a constant string is performed at compile time. Actually, this is a special case of simplifying compilation time of constant expressions. You can find the exact rules for what is considered a constant expression in the Java Language Specification.
- Array character initializers are syntactic sugar . The code is equivalent to creating an array and assigning elements one at a time. (Note that this is specific to Java bytecode. Dalvik (i.e. Android) has special abbreviated instructions for initializing the array)
- Inline initializers - syntactic sugar . The code is equivalent to manually initializing fields in a static initialization method.
Change Another detail is that in the special case of static end fields initialized with the inline constant expression, the field is initialized using the ConstantValue attribute, and not in the static initializer, and all its use is embeddable. Thus, in the case of LongString , # 3 will actually result in a different bytecode, but since the constant pool entries for the string are the same, the size of the files occupied by the strings will not change.
Put them together and the above code is equivalent to the following.
public class StringArray { private static final String[] stringArray; static { String[] temp = new String[5]; temp[0] = "AAA"; temp[1] = "AAB"; temp[2] = "AAA"; temp[3] = "AAC"; temp[4] = "AAA"; stringArray = temp; } } public class LongString { private static final String longString; static { longString = "AAAAABAAAAACAAA"; } }
Now this highlighted code still shows duplicate lines several times in the array example. To understand the size behavior of a class, you must understand what it is compiled with.
When you access a constant line, the bytecode contains a load constant instruction ( ldc or ldc_w ), which is the byte code of the operation, followed by an index into the class constant pool. A persistent pool is a separate section of a class that stores a list of constants. Obviously, the compiler will store each constant only once.
So, the bytecode for StringArray looks something like this (removing a few details that are not relevant here). Note that only 3 unique rows are stored in a persistent pool. (There are actually many more persistent pool entries related to other parts of the classfile, but they are not important here).
.class super StringArray .field static final private stringArray [Ljava/lang/String; .const [1] = String 'AAA' .const [2] = String 'AAB' .const [3] = String 'AAC' .method static <clinit> : ()V iconst_5 anewarray java/lang/String astore_0 aload_0 iconst_0 ldc [1] aastore aload_0 iconst_1 ldc [2] aastore aload_0 iconst_2 ldc [1] aastore aload_0 iconst_3 ldc [3] aastore aload_0 iconst_4 ldc [1] aastore aload_0 putstatic StringArray stringArray [Ljava/lang/String; return .end method
While LongString looks something like this:
.class super LongString .field static final private longString Ljava/lang/String; .const [1] = String 'AAAAABAAAAACAAA' .method static <clinit> : ()V ldc [1] putstatic LongString longString Ljava/lang/String; return .end method
So, with the first version, duplicate lines can be stored only once, while with the second version it is necessary to save the entire line. So is the first version always better? Not so fast. This has the advantage that you do not save duplicate lines, but as you may have noticed, there is a large invoice on the element in the array initialization. Which one is better depends on how long your lines are and how many duplicates you have.
PS At the binary level, constant strings are encoded with a modified UTF8 encoding. The result is that characters 1-127 are one byte each, but empty characters are two bytes. This way you can save some space by shifting all 1.