Java xml parser with emoji symbol

The following code is used to parse the XML file. I noticed that emoji char is not being processed correctly. In this example, there is one emoji at the end at the input ( http://www.iemoji.com/view/emoji/693/people/revolving-hearts ), the symbol doubles at the output. Is this a known bug?

import java.io.File; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NamedNodeMap; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class XmlTest { public static void main(String[] args) { DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance(); domFactory.setValidating(false); File file = new File("c:\\temp\\emoji.xml"); try { DocumentBuilder builder = domFactory.newDocumentBuilder(); Document doc = builder.parse(file); NodeList nodes = doc.getElementsByTagName("entry"); Node node = nodes.item(0); NamedNodeMap map = ((Element)node).getAttributes(); for (int i=0; i<map.getLength(); i++) { Node n = map.item(i); System.out.println(); System.out.println(n.getNodeValue()); char[] chars = n.getNodeValue().toCharArray(); for (int j=0; j<chars.length; j++) { System.out.print(chars[j] + ", " + (int)chars[j] + " "); } } } catch (Exception e) {e.printStackTrace(); } } } 

This is where emoji.xml is entered:

 <Attributes> <Map> <entry key="name" value="πŸ’žtestπŸ’ž"/> </Map> </Attributes> 

and conclusion:

 name n, 110 a, 97 m, 109 e, 101 πŸ’žtestπŸ’žπŸ’ž ?, 55357 ?, 56478 t, 116 e, 101 s, 115 t, 116 ?, 55357 ?, 56478 ?, 55357 ?, 56478 
+4
source share
2 answers

I can reproduce the problem using JDK 1.7.

The cause of the problem is an error. in the XML parsing that comes with the JDK (In this case, it is Xerces located in the com.sun.org.apache.xerces.internal.* packages in rt.jar)

Emoji characters are not in Unicode BMP and therefore are represented as two characters (high and low surrogate). When the parser encounters these substitutes, it processes them in a special way and checks to see if they are a valid XML character when converted to an extra character.

The XMLScanner.scanAttributeValue code is in XMLScanner.scanAttributeValue in the following code section

  } else if (c != -1 && XMLChar.isHighSurrogate(c)) { if (scanSurrogates(fStringBuffer3)) { stringBuffer.append(fStringBuffer3); if (entityDepth == fEntityDepth && fNeedNonNormalizedValue) { fStringBuffer2.append(fStringBuffer3); } 

Two characters for the emoji character are parsed into the fStringBuffer3 buffer variable, and then added to the buffer for the attribute value. The problem is that fStringBuffer3 not cleared. When parsing the second emoji character, it still contains the old content, so the characters are added twice.

If you try to use an attribute value containing three or more emojis, you can clearly see how they accumulate.

+4
source

Several updates: this problem was fixed in an earlier version of Java release version 9 (build 9-ea + 103-2016-01-27-183833.javare.4341.nc). It still exists in the latest version of Java 8 (build 1.8.0_72-b15). For some reason, Oracle closed an error that was open due to my service request against Java 6/7/8 on this issue (as not reproducible). I'm trying to get them to open it again.

Here is the same problem open against openjdk, they fixed it in openjdk 9: https://bugs.openjdk.java.net/browse/JDK-8062362

+1
source

All Articles