Difference between compact lines and compressed lines in Java 9

Question

Difference between compact lines and compressed lines in Java 9

What are the advantages of compact lines over compressed lines in JDK9?

+59

java java-9

anonymous May 25 '17 at 10:38

source share

4 answers

XX: + UseCompressedStrings and Compact Strings are two different things.

UseCompressedStrings means that strings that are ASCII can be converted to byte[] , but this was disabled by default. In jdk-9, this optimization is always enabled, but not through the flag itself, but is embedded.

Until java-9 strings are stored inside char[] in UTF-16 encoding. From java-9 and up, they will be stored as byte[] . Why?

Because in ISO_LATIN_1 each character can be encoded in one byte (8 bits) against what it is used to date (16 bits, 8 of which have never been used). This only works for ISO_LATIN_1 , but this is most of the strings used.

So this is done to use space.

Here is a small example that should make everything clearer:

 class StringCharVsByte { public static void main(String[] args) { String first = "first"; String russianFirst = ""; char[] c1 = first.toCharArray(); char[] c2 = russianFirst.toCharArray(); for (char c : c1) { System.out.println(c >>> 8); } for (char c : c2) { System.out.println(c >>> 8); } } }

In the first case, we will only get zeros, which means that the most significant 8 bits are zeros; in the second case there will be a nonzero value, which means that at least one bit of the most significant 8 is present.

This means that if inside we store strings as an array of characters, there are string literals that actually spend half of each char. It turns out there are several applications that actually spend a lot of space because of this.

Do you have a 10 character string of Latin1? You just lost 80 bits, or 10 bytes. To reduce this line compression was performed. And now for these lines there will be no loss of space.

Inside, it also means some very nice things. To distinguish between the line LATIN1 and UTF-16 , there is a coder field:

 /** * The identifier of the encoding used to encode the bytes in * {@code value}. The supported values in this implementation are * * LATIN1 * UTF16 * * @implNote This field is trusted by the VM, and is a subject to * constant folding if String instance is constant. Overwriting this * field after construction will cause problems. */ private final byte coder;

Now based on this, length calculated differently:

 public int length() { return value.length >> coder(); }

If our string is only Latin1, the encoder will be zero, so the length of the value (an array of bytes) will be the size of the characters. For non-Latin1, divide into two.

+22

Eugene May 25 '17 at 10:57

source share

Compact strings will have the best of both worlds.

As you can see from the definition in the OpenJDK documentation:

The new String class will store characters encoded either as ISO-8859-1 / Latin-1 (one byte per character) or UTF-16 (two bytes per character) based on the contents of the string. The encoding flag will indicate which encoding is being used.

As @Eugene already mentioned, most strings are encoded in Latin-1 format and require one byte per character and therefore do not require a full 2-byte space in the current implementation of the String class.

The new implementation of the String class will shift from the UTF-16 char array to a byte array plus the encoding flag field . An additional encoding field will indicate whether characters are stored in UTF-16 or Latin-1 format.

This also concludes that we can also store strings in UTF-16 format, if necessary. And this also becomes the main point of difference between Java 6 compressed string and Java 9 Compact String, since the compressed String uses only the byte [] array for storage, which was then presented as pure ASCII.

+7

Dhaval Simaria May 25 '17 at 11:14

source share

Compressed Strings (-XX: + UseCompressedStrings)

This was an additional feature introduced in Java 6 Update 21 for improving SPECjbb by encoding only US-ASCII String by byte per character.

This function can be enabled with the -XX flag ( -XX:+UseCompressedStrings ). When it is turned on, String.value been changed to an Object link and will either point to byte[] , for strings containing only 7-bit US-ASCII characters, or a char[] .

It was later removed in Java 7 due to the high level of service and testing.

Compact row

This is a new feature introduced in Java 9 to create an effective memory string.

Before Java 9, the String class stores characters in a char array, using two bytes for each character, but from Java 9, the new String class will store characters in byte[] (one byte per character) or char[] (two bytes per character) based on the contents of the string, plus the encoding flag field. If string characters are of type Latin-1 , then byte[] will be used differently, if characters are of type UTF-16 , then char[] will be used. The encoding flag will indicate which encoding is being used.

0

Mohit Tyagi 04 Oct '17 at 12:24

source share

Nicolai · Accepted Answer · 2017-05-25 11:22

Compressed lines (Java 6) and compact lines (Java 9) have the same motivation (lines are often Latin-1 efficient, so half space is lost) and purpose (to make these lines small), but the implementations differ a lot.

Compressed lines

In an interview, Alexey Shipilev (who was responsible for implementing the Java 9 function) said about the compressed lines:

The UseCompressedStrings function was rather conservative: when distinguishing between the char[] and byte[] events and trying to compress the char[] construct in byte[] on String it performed most of the String operations on char[] , which required to unpack the String. . Therefore, he used only a special type of workload, where most strings are compressible (so that compression is not wasted), and only a limited number of known String operations are performed on them (therefore, unpacking is not required). Under heavy load, including -XX:+UseCompressedStrings was a pessimization.
[...] The implementation of UseCompressedStrings was basically an optional function that supported the completely different String implementation in alt-rt.jar , which was loaded after the VM option was delivered. Additional features are harder to test, as they double the number of option combinations to try.

Compact strings

In Java 9, on the other hand, compact strings are fully integrated into the JDK source. String always supported byte[] , where characters use one byte if they are Latin-1 and two others. Most operations check to see what happens, for example. charAt :

 public char charAt(int index) { if (isLatin1()) { return StringLatin1.charAt(value, index); } else { return StringUTF16.charAt(value, index); } }

Compact strings are enabled by default and can be partially disabled - "partially" because they are still supported byte[] , and the returned char operations must still combine them from two separate bytes (due to the internal properties, it’s hard to say whether this is for performance).

More details

If you are interested in learning more about compact lines, I recommend reading the interview with which I am connected above and / or watching this great conversation by the same Alexei Shipilev (which also explains the new string concatenation).

Difference between compact lines and compressed lines in Java 9

Compressed lines

Compact strings

More details

More articles: