I have a UTF-16 line that contains some characters not represented directly on my local Windows-1252 code page:
6/23/2011 9:23:44 α΄α΄
I use WideCharToMultiByte to convert a string to a local code page (Windows-1252 on my North American computer):
WideCharToMultiByte( CP_ACP, //target CodePage 0, //flags Source, //my string, eg "6/23/2011 9:23:44 α΄α΄" length, //length in characters, eg 20 buffer, //destination where to put the string bufferSize, null, //optional, character to use when character cannot be represented null); //optional, out boolean that indicates if any character could not be represented
And the line is output as:
6/23/2011 9:23:44 ??
Literally with question marks 0x3f = "?" for non-representable characters.
When you look at the characters in the original string "α΄α΄", two characters:
α΄ U + 1D00: Latin letter Small Capital Aα΄ U + 1D0D: Latin letter Small Capital M
The Unicode standard says: these are phonetic extensions, and for general text: use plain Latin . Which for me means that the text should be converted to:
6/23/2011 9:23:44 AM
or
6/23/2011 9:23:44 AM
Another example could be 6β²2β³ :
β² U + 2032: Primeβ³ U + 2033: Double Prime
When I convert this line to Windows-1252 , it becomes 6'2? (apostrophe ? ).
The Unicode entry for Prime lists the apostrophe as an alternative:
2032 β² PRIME = minutes, feet β 0027 ' apostrophe β 00B4 Β΄ acute accent β 02B9 ΚΉ modifier letter prime
Even if prime does not exist in the target code page, WideCharToMultiByte converts it to one of the closest equivalents (for example, an apostrophe).
On the other hand, Double Prime :
2033 β³ DOUBLE PRIME = seconds, inches β 0022 " quotation mark β 02BA ΚΊ modifier letter double prime β 201D " right double quotation mark β 2032 β² 2032 β²
nothing is displayed ( ? ), but some other elements exist in my Windows-1252 encoding:
Character Unicode Windows-1252 ========================================= ======= ============ β³ double prime U+2032 - " quotation mark U+0022 0x22 ΚΊ modifier letter double prime U+02BA - " right double quotation mark U+201D 0x94 β² prime U+2032 - ' apostrophe U+0027 0x27 Β΄ acute accent U+00B4 0xb4 ΚΉ modifier letter prime U+02B9
Even in the worst case, when he decomposes the original double prime into prime prime : prime has an equivalent - since he already used it.
For other characters, there are also mappings:
Character Unicode Windows-1252
How to make WideCharToMultiByte best match between encodings?