How to get the best Unicode character matches between encodings

Question

How to get the best Unicode character matches between encodings

I have a UTF-16 line that contains some characters not represented directly on my local Windows-1252 code page:

 6/23/2011 9:23:44 ᴀᴍ

I use WideCharToMultiByte to convert a string to a local code page (Windows-1252 on my North American computer):

 WideCharToMultiByte( CP_ACP, //target CodePage 0, //flags Source, //my string, eg "6/23/2011 9:23:44 ᴀᴍ" length, //length in characters, eg 20 buffer, //destination where to put the string bufferSize, null, //optional, character to use when character cannot be represented null); //optional, out boolean that indicates if any character could not be represented

And the line is output as:

 6/23/2011 9:23:44 ??

Literally with question marks 0x3f = "?" for non-representable characters.

When you look at the characters in the original string "ᴀᴍ", two characters:

ᴀ U + 1D00: Latin letter Small Capital A
ᴍ U + 1D0D: Latin letter Small Capital M

The Unicode standard says: these are phonetic extensions, and for general text: use plain Latin . Which for me means that the text should be converted to:

 6/23/2011 9:23:44 AM

or

 6/23/2011 9:23:44 AM

Another example could be 6′2″ :

′ U + 2032: Prime
″ U + 2033: Double Prime

When I convert this line to Windows-1252 , it becomes 6'2? (apostrophe ? ).

The Unicode entry for Prime lists the apostrophe as an alternative:

 2032 ′ PRIME = minutes, feet → 0027 ' apostrophe → 00B4 ´ acute accent → 02B9 ʹ modifier letter prime

Even if prime does not exist in the target code page, WideCharToMultiByte converts it to one of the closest equivalents (for example, an apostrophe).

On the other hand, Double Prime :

 2033 ″ DOUBLE PRIME = seconds, inches → 0022 " quotation mark → 02BA ʺ modifier letter double prime → 201D " right double quotation mark ≈ 2032 ′ 2032 ′

nothing is displayed ( ? ), but some other elements exist in my Windows-1252 encoding:

 Character Unicode Windows-1252 ========================================= ======= ============ ″ double prime U+2032 - " quotation mark U+0022 0x22 ʺ modifier letter double prime U+02BA - " right double quotation mark U+201D 0x94 ′ prime U+2032 - ' apostrophe U+0027 0x27 ´ acute accent U+00B4 0xb4 ʹ modifier letter prime U+02B9

Even in the worst case, when he decomposes the original double prime into prime prime : prime has an equivalent - since he already used it.

For other characters, there are also mappings:

 Character Unicode Windows-1252 ========================================= ======= ============ ᴀ Latin Letter Small Capital A U+1D00 - A Latin Capital Letter A U+0041 0x41 a Latin Small Letter A U+0061 0x61 ᴍ Latin Letter Small Capital M U+1D0D - M Latin Capital Letter M U+004D 0x4d m Latin Small Letter M U+006D 0x6d

How to make WideCharToMultiByte best match between encodings?

+4

winapi unicode internationalization character-encoding globalization

Ian boyd Jun 23 '11 at 14:33

source share

3 answers

How well WideCharToMultiByte may depend on the version of Windows you are using. I believe that newer versions use more complete tables. However, it will probably never cover all cases. Since Windows prefers Unicode initially, there is not much incentive for them to implement all fallback cases for code page zillions.

Your choice is to use a library (for example, ICU, as others have mentioned) or create your own preprocessor to handle backups.

+1

Adrian mccarthy Jun 23 '11 at 20:25

source share

If you know the target encoding, you can use the standardized Posix iconv library (which is also available for Windows) and convert from WCHAR_T or UTF-16 to the target encoding; iconv has a "transliteration" option that can turn all your special needs characters into their ASCII transliterators. Iconv is a bit lighter than the ICU and quite widely available.

+1

Kerrek SB Jun 23 '11 at 10:55

source share

Paweł dyda · Accepted Answer · 2011-06-23T16:00:31+0000

I do not think you can change the results of WideCharToMultiByte() . If you are careful enough to try another ICU solution, you can give different results.

Personally, I have not tried it, so I can not guarantee the results (who needs to convert from Unicode, anyway?), But I believe that you should use ICU Converters . The best part about this is Unicode 6.0 support (you don't need it anyway).

How to get the best Unicode character matches between encodings

More articles: