How to determine the display width of a Unicode string in Python 3.x, and is there a way to use this information to align these strings with str.format() ?
An example of motivation . A printout of the row table on the console. Some lines contain non-ASCII characters.
>>> for title in d.keys(): >>> print("{:<20} | {}".format(title, d[title])) zootehni- | zooteh. zootekni- | zootek. zoothèque | zooth. zooveterinar- | zoovet. zoovetinstitut- | zoovetinst.母 | 母母>>> s = 'è' >>> len(s) 2 >>> [ord(c) for c in s] [101, 768] >>> unicodedata.name(s[1]) 'COMBINING GRAVE ACCENT' >>> s2 = '母' >>> len(s2) 1
As you can see, str.format() simply takes the number of code points in the line ( len(s) ) as its width, which leads to skewing the columns in the output. Search through the unicodedata module, I did not find anything, suggesting a solution.
Unicode normalization can solve the problem for è, but not for Asian characters, which often have a wide display width. Similarly, there are zero-width Unicode characters (for example, zero-width space for resolving line breaks in words). You cannot get around these normalization problems, so please do not suggest "normalize your lines".
Edit: Added normalization information.
Edit 2: In my original dataset there are also some combining characters in Europe that do not lead to the creation of a single code point even after normalization:
zwemwater | zwemw. zwia̢z- | zw. >>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz >>> len(unicodedata.normalize('NFC', s3)) 2
python string width unicode python-unicode
Christian aichinger
source share