Unicode base 64 encoding with java

I am trying to encode and decode a UTF8 string for base64. Theoretically, not a problem, but when decoding, they never output the correct characters, but ?.

String original = "ุฎู‡ุนุณูŠุจู†ุชุง"; B64encoder benco = new B64encoder(); String enc = benco.encode(original); try { String dec = new String(benco.decode(enc.toCharArray()), "UTF-8"); PrintStream out = new PrintStream(System.out, true, "UTF-8"); out.println("Original: " + original); prtHx("ara", original.getBytes()); out.println("Encoded: " + enc); prtHx("enc", enc.getBytes()); out.println("Decoded: " + dec); prtHx("dec", dec.getBytes()); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } 

The console output is as follows:

Original: ุฎู‡ุนุณูŠุจู†ุชุง
ara = 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F
Coded: Pz8 / Pz8 / Pz8 /
enc = 50, 7A, 38, 2F, 50, 7A, 38, 2F, 50, 7A, 38, 2F
Decoded: ?????????
dec = 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F

thank you just write the hexadecimal value of the bytes in the output file. Am I doing something obviously wrong here?


Andreas pointed out the correct solution, emphasizing that the getBytes () method uses the standard platform encoding (Cp1252), although the source file itself is UTF-8. Using getBytes ("UTF-8"), I was able to notice that the bytes encoded and decoded were actually different. further research showed that the encoding method uses getBytes (). Changing this did the trick beautifully.

 try { String enc = benco.encode(original); String dec = new String(benco.decode(enc.toCharArray()), "UTF-8"); PrintStream out = new PrintStream(System.out, true, "UTF-8"); out.println("Original: " + original); prtHx("ori", original.getBytes("UTF-8")); out.println("Encoded: " + enc); prtHx("enc", enc.getBytes("UTF-8")); out.println("Decoded: " + dec); prtHx("dec", dec.getBytes("UTF-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } 

System Coding Cp1252
Original: ุฎู‡ุนุณูŠุจู†ุชุง
ori = D8, AE, D9, 87, D8, B9, D8, B3, D9, 8A, D8, A8, D9, 86, D8, AA, D8, A7
Encoded: 2K7Zh9i52LPZitio2YbYqtin
enc = 32, 4B, 37, 5A, 68, 39, 69, 35, 32, 4C, 50, 5A, 69, 74, 69, 6F, 32, 59, 62, 59, 71, 74, 69, 6E
Decoded: ุฎู‡ุนุณูŠุจู†ุชุง
dec = D8, AE, D9, 87, D8, B9, D8, B3, D9, 8A, D8, A8, D9, 86, D8, AA, D8, A7

Thanks.

+4
source share
1 answer

String#getBytes() encodes characters using the default platform encoding. The actual encoding of the string literal "ุฎู‡ุนุณูŠุจู†ุชุง" "defined" in the java source file (you select the character encoding when creating or saving the file).

This may be the reason why ara is encoded at 0x3f bytes ..

Try:

 out.println("Original: " + original); prtHx("ara", original.getBytes("UTF-8")); out.println("Encoded: " + enc); prtHx("enc", enc.getBytes("UTF-8")); out.println("Decoded: " + dec); prtHx("dec", dec.getBytes("UTF-8")); 
+6
source

All Articles