UTF Encoding for Chinese Characters

I get a String through an object from the axis web service. Since I do not get the string that I expected, I did a check by converting the string to bytes, and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I expect E4BDA0 E5A5BD E59097, which is actually δ½  ε₯½ε— in UTF- 8.

Any ideas that δ½  ε₯½ε— might trigger, become C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297? I did a google search, but all I had was a Chinese site describing the problem that python is having. Any ideas would be great, thanks!

+5
source share
1 answer

You have the so-called double encoding.

"δ½  ε₯½ε—", , UTF-8 E4BDA0 E5A5BD E59097.

THAT UTF-8. E4. UTF-8? ! C3 A4!

....: -)

Java, :

public class DoubleEncoding {
    public static void main(String[] args) throws Exception {
        byte[] encoding1 = "δ½ ε₯½ε—".getBytes("UTF-8");
        String string1 = new String(encoding1, "ISO8859-1");
        for (byte b : encoding1) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
        byte[] encoding2 = string1.getBytes("UTF-8");
        for (byte b : encoding2) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
    }

}

+14

All Articles