How to undo Unicode decomposition using Python?

Using Python 2.5, I have some text stored in a unicode object:

Dinis e Isabel, uma difı'cil relac¸a ~ o marital politics

It looks like decomposed Unicode . Is there a general way in Python to undo decomposition, so I get:

Dinis e Isabel, uma difícil relação marital police

+5
source share
3 answers

I think you are looking for this:

>>> import unicodedata    
>>> print unicodedata.normalize("NFC",u"c\u0327")
ç
+7
source

Unfortunately, it looks like I actually have (for example) \ u00B8 (cedilla) instead of \ u0327 (a combination of cedilla) in my text.

, ! , , (NFKD).

U + 00B8 NFKD, , U + 0327. , , , . , NFC, .

s= unicodedata.normalize('NFKD', s)
s= ''.join(c for i, c in enumerate(s) if c!=' ' or unicodedata.combining(s[i+1])==0)
s= unicodedata.normalize('NFC', s)
+5

, . unicodedata module. decomposition() normalize(), .

Edit: make sure it is really unicode decomposed. Sometimes there are strange ways to write characters that cannot be directly expressed in encoding. Like "a, which is intended for mental analysis by a person or some specialized program like ä.

+1
source

All Articles