I can reproduce the problem using JDK 1.7.
The cause of the problem is an error. in the XML parsing that comes with the JDK (In this case, it is Xerces located in the com.sun.org.apache.xerces.internal.* packages in rt.jar)
Emoji characters are not in Unicode BMP and therefore are represented as two characters (high and low surrogate). When the parser encounters these substitutes, it processes them in a special way and checks to see if they are a valid XML character when converted to an extra character.
The XMLScanner.scanAttributeValue code is in XMLScanner.scanAttributeValue in the following code section
} else if (c != -1 && XMLChar.isHighSurrogate(c)) { if (scanSurrogates(fStringBuffer3)) { stringBuffer.append(fStringBuffer3); if (entityDepth == fEntityDepth && fNeedNonNormalizedValue) { fStringBuffer2.append(fStringBuffer3); }
Two characters for the emoji character are parsed into the fStringBuffer3 buffer variable, and then added to the buffer for the attribute value. The problem is that fStringBuffer3 not cleared. When parsing the second emoji character, it still contains the old content, so the characters are added twice.
If you try to use an attribute value containing three or more emojis, you can clearly see how they accumulate.
source share