How to iterate over unicode characters, not bytes in python?

Question

How to iterate over unicode characters, not bytes in python?

Given the accented Unicode word, for example u'́' , I need to remove the sharp ( u'' ), and also change the format of the accent to u'+' , where '+' represents the sharp over the previous letter.

Now I use the dictionary for recognizable and uncharacteristic characters:

 accented_list = [u'́', u'́', u'́', u'́', u'́', u'́', u'́', u'́', u'́'] regular_list = [u'', u'', u'', u'', u'', u'', u'', u'', u''] accent_dict = dict(zip(accented_list, regular_list))

I want to do something like this:

 def changeAccentFormat(word): for letter in accent_dict: if letter in word: its_index = word.index(letter) word = word[:its_index + 1] + u'+' + word[its_index + 1:] return word

But of course, this does not work as desired. I noticed that this code:

 >>> word = u'́' >>> for letter in word: ... print letter

gives

´

(Well, I did not expect the appearance of an empty character, but nonetheless). So, I wonder what is the easiest way to produce [u'', u'', u'́', u'', u''] ? Or maybe there is some way to solve my problem without this?

+7

python unicode python-unicode

Frauhahnhen Dec 26 '13 at 12:04

source share

3 answers

Acutes are represented by code number 301, COMBINING ACUTE ACCENT , so there should be a fairly simple replacement for the string character:

 >>>print u'́'.replace(u'\u0301', "+") +

If you come across accented characters that are not encoded with a matching accent, unicodedata.normalize should do the trick

+1

goncalopp Dec 26 '13 at 13:51

source share

You can create [u'', u'', u'́', u'', u''] using the regex module.

Here is the word you have for each user-perceived character:

 >>> import regex >>> word = u'́' >>> len(word) 6 >>> regex.findall(r'\X', word) ['', '', '́', '', ''] >>> len(regex.findall(r'\X', word)) 5

+1

dawg May 07, '15 at 19:29

source share

Lukas Graf · Accepted Answer · 2013-12-26T14:08:54+0000

First of all, with regard to iterating over characters instead of bytes, you already do it right - your word is a unicode object, not a coded byte network.

Now, for combinational characters in Unicode:

For many characters containing combinational characters, there is a form of a folded and decomposed entry consisting of one code point and an expanded sequence of two (or more?) Code points:

See U + 00E7 , U + 0063 and U + 0327

So, in Python you can either write any form, it will be compiled during the display of one character:

 >>> combining_cedilla = u'\u0327' >>> c_with_cedilla = u'\u00e7' >>> letter_c = u'\u0063' >>> >>> print c_with_cedilla ç >>> print letter_c + combining_cedilla ç

To convert between folded and decomposed forms, you can use unicodedata.normalize() :

 >>> import unicodedata >>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla) >>> decomp = unicodedata.normalize('NFD', c_with_cedilla) >>> >>> print comp ç >>> print decomp ç

( NFC means "normal form C" (folded) and NFD for "normal form D" (decomposed).

They are still different forms: one consists of one code point, the other of two:

 >>> comp == decomp False >>> len(comp) 1 >>> len(decomp) 2

However, in your case, there is simply no combined character for lowercase with a sharp accent (there is one for with an impact grave )

How to iterate over unicode characters, not bytes in python?

More articles: