Preconstruct Unicode character sequences in Python

Question

Preconstruct Unicode character sequences in Python

How can I convert decomposed Unicode character sequences such as "LATIN SMALL LETTER E" + "COMBINE ACCENT ACCENT" (or U + 0075 + U + 0301) so that they become a precomposition: "LATIN SMALL LETTER E WITH ACUTE" (or U + 00E9) using Python 2.5+ native functions?

If that matters, I am on Mac OS X (10.6.4), and I saw the question Convert to a Unicode Precomposed String using Python-AppKit-ObjectiveC , but unfortunately, while the described operation function CoreFoundation OS X CFStringNormalize not the script is interrupted or stops, it just does nothing. And by this I do not mean that it does not return anything (its return type is invalid - it mutates in place). I also tried all possible values for a constant parameter that sets the pre-layout or decomposition in canonical or non-canonical forms.

This is why I am looking for my own Python method to handle this case.

Thanks so much for reading!

Andre

+4

python unicode macos

andreb Oct 2 '10 at 12:57

source share

1 answer

unutbu · Accepted Answer · 2010-10-02T13:10:50+0000

 import unicodedata as ud astr=u"\N{LATIN SMALL LETTER E}" + u"\N{COMBINING ACUTE ACCENT}" combined_astr=ud.normalize('NFC',astr)

'NFC' tells ud.normalize to apply canonical decomposition ('NFD'), then compose pre-combined characters:

 print(ud.name(combined_astr)) # LATIN SMALL LETTER E WITH ACUTE

Both of them print the same thing:

 print(astr) # é print(combined_astr) # é

But their views are different:

 print(repr(astr)) # u'e\u0301' print(repr(combined_astr)) # u'\xe9'

And their encodings, say utf_8 , (and not surprisingly) also differ:

 print(repr(astr.encode('utf_8'))) # 'e\xcc\x81' print(repr(combined_astr.encode('utf_8'))) # '\xc3\xa9'

Preconstruct Unicode character sequences in Python

More articles: