Why does python string reduction return 11 characters when 12 is requested?

I am using python 2.7 on OSX 10.9 and would like to shorten the unicode line ( 05. .mp3 ) by 12 characters, so I use mp3file[:12] to cut it to 12 characters. But as a result, I get a line like 05. .m , which has only 11 characters. But len(mp3file[:12]) returns 12. It seems that the problem is related to the Russian symbol .

What could be wrong here?

The main problem with this is that I cannot display lines with {:<12}'.format(mp3file[:12]) .

+2
python string unicode
source share
2 answers

You have Unicode text with a combining character:

 u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m' 

U + 0306 is a COMBINING BREVE comedian, ̆ , it is combined with the previous one CYRILLIC SMALL LETTER I to form:

 >>> print u'\u0438'  >>> print u'\u0438\u0306'  

You can normalize this in a combined form , U + 0439 CYRILLIC SMALL LETTER SHORT I instead:

 >>> import unicodedata >>> unicodedata.normalize('NFC', u'\u0438\u0306') u'\u0439' 

It uses the unicodedata.normalize() function to create a molded normal shape.

+5
source share

A custom perceptible character ( grapheme cluster ), such as , can be constructed using several Unicode codepoints , each Unicode code in turn can be encoded using several bytes depending on the character encoding.

Therefore, the number of characters you see may be less than the corresponding sizes of the Unicode strings or bytes that encode them, and you can also trim inside the Unicode character if you cut bytes or inside the user-perceptible character if you cut the Unicode string, even if it is in NFC Unicode Normalization Form . Obviously not desirable .

To correctly count characters , you can use \X regex , which corresponds to the eXtended grapheme cluster (language-independent "visual character") :

 import regex as re # $ pip install regex characters = re.findall(u'\\X', u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m') print(characters) # -> [u'0', u'5', u'.', u' ', u'\u0427', u'\u0430', # u'\u0438\u0306', u'\u043a', u'\u0430', u'.', u'm'] 

Note that even without normalization: u'\u0438\u0306' is a separate '̆' character.

 >>> import unicodedata >>> unicodedata.normalize('NFC', u'\u0646\u200D ') # 3 Unicode codepoints u'\u0646\u200d ' # still 3 codepoints, NFC hasn't combined them >>> import regex as re >>> re.findall(u'\\X', u'\u0646\u200D ') # same 3 codepoints [u'\u0646\u200d', u' '] # 2 grapheme clusters 

See also In Python, how to most efficiently cut a UTF-8 string for REST delivery?

+3
source share

All Articles