How to undo Unicode decomposition using Python?

Question

How to undo Unicode decomposition using Python?

Using Python 2.5, I have some text stored in a unicode object:

Dinis e Isabel, uma difı'cil relac¸a ~ o marital politics

It looks like decomposed Unicode . Is there a general way in Python to undo decomposition, so I get:

Dinis e Isabel, uma difícil relação marital police

+5

python unicode

msanders Jan 15 '09 at 10:08

source share

3 answers

Unfortunately, it looks like I actually have (for example) \ u00B8 (cedilla) instead of \ u0327 (a combination of cedilla) in my text.

, ! , , (NFKD).

U + 00B8 NFKD, , U + 0327. , , , . , NFC, .

s= unicodedata.normalize('NFKD', s)
s= ''.join(c for i, c in enumerate(s) if c!=' ' or unicodedata.combining(s[i+1])==0)
s= unicodedata.normalize('NFC', s)

+5

bobince 15 . '09 14:55

, . unicodedata module. decomposition() normalize(), .

Edit: make sure it is really unicode decomposed. Sometimes there are strange ways to write characters that cannot be directly expressed in encoding. Like "a, which is intended for mental analysis by a person or some specialized program like ä.

+1

unbeknown Jan 15 '09 at 10:18

source share

Rafał dowgird · Accepted Answer · 2009-01-15T10:33:47+0000

I think you are looking for this:

>>> import unicodedata    
>>> print unicodedata.normalize("NFC",u"c\u0327")
ç

How to undo Unicode decomposition using Python?

More articles: