Why iconv can convert a pre-assembled form, but not an unfolded form “É” (from UTF-8 to CP1252)

I use the iconv library for an interface from a modern input source that uses UTF-8 for an older system that uses Latin1 as well as CP1252 (a superset of ISO-8859-1).

The interface has recently failed to convert the French line "Education", where "É" was encoded as hex 45 CC 81 . Note that the destination encoding has the character “É” encoded as C9 .

Why iconv failed to convert "É"? I checked that the iconv command-line tool, available with MacOS X 10.7.3, says that it cannot convert and that the PERL iconv module also fails.

This is even more alarming because the pre-selected form of the character "É" (encoded as C3 89 ) is converted just fine.

Is this a bug with iconv or am I missing something?

Please note that I also have the same problem if I try to convert from UTF-16 (where “É” is encoded as 00 C9 composed or 00 45 03 01 decomposed).

+8
unicode iconv
source share
2 answers

Unfortunately, iconv really has nothing to do with decomposed characters in UTF-8, except for the version installed on Mac OS X.

When working with Mac file names, you can use iconv with the utf8-mac character set option. It also takes into account several features of the unfolded form of the Mac .

However, non-mac versions of iconv or libiconv do not support this, and I could not find the sources used on the Mac that provide this support.

I agree with you that iconv should deal with the NFC and NFD UTF8 formats, but until someone corrects the sources, we must detect this manually and work with it before transferring the material to iconv.

Faced with this annoying problem, I used the Perl Unicode :: Normalize module proposed by Jukka.

 #!/usr/bin/perl use Encode qw/decode_utf8 encode_utf8/; use Unicode::Normalize; while (<>) { print encode_utf8( NFC(decode_utf8 $_) ); } 
+5
source share

Use the normalizer (in this case, before the normalization form C) before calling iconv.

It is assumed that a program that deals with character encodings (different representations of characters or, more precisely, code points, as sequences of bytes) and conversions between them, should process pre-compiled and composed forms as different. The decomposed É is two code points and as such differs from the precomposition É, which is one code point.

0
source share

All Articles