A custom perceptible character ( grapheme cluster ), such as , can be constructed using several Unicode codepoints , each Unicode code in turn can be encoded using several bytes depending on the character encoding.
Therefore, the number of characters you see may be less than the corresponding sizes of the Unicode strings or bytes that encode them, and you can also trim inside the Unicode character if you cut bytes or inside the user-perceptible character if you cut the Unicode string, even if it is in NFC Unicode Normalization Form . Obviously not desirable .
To correctly count characters , you can use \X regex , which corresponds to the eXtended grapheme cluster (language-independent "visual character") :
import regex as re
Note that even without normalization: u'\u0438\u0306' is a separate '̆' character.
>>> import unicodedata >>> unicodedata.normalize('NFC', u'\u0646\u200D ')
See also In Python, how to most efficiently cut a UTF-8 string for REST delivery?
jfs
source share