Unicode string display width in Python

How to determine the display width of a Unicode string in Python 3.x, and is there a way to use this information to align these strings with str.format() ?

An example of motivation . A printout of the row table on the console. Some lines contain non-ASCII characters.

 >>> for title in d.keys(): >>> print("{:<20} | {}".format(title, d[title])) zootehni- | zooteh. zootekni- | zootek. zoothèque | zooth. zooveterinar- | zoovet. zoovetinstitut- | zoovetinst.母 | 母母>>> s = 'è' >>> len(s) 2 >>> [ord(c) for c in s] [101, 768] >>> unicodedata.name(s[1]) 'COMBINING GRAVE ACCENT' >>> s2 = '母' >>> len(s2) 1 

As you can see, str.format() simply takes the number of code points in the line ( len(s) ) as its width, which leads to skewing the columns in the output. Search through the unicodedata module, I did not find anything, suggesting a solution.

Unicode normalization can solve the problem for è, but not for Asian characters, which often have a wide display width. Similarly, there are zero-width Unicode characters (for example, zero-width space for resolving line breaks in words). You cannot get around these normalization problems, so please do not suggest "normalize your lines".

Edit: Added normalization information.

Edit 2: In my original dataset there are also some combining characters in Europe that do not lead to the creation of a single code point even after normalization:

  zwemwater | zwemw. zwia̢z- | zw. >>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz >>> len(unicodedata.normalize('NFC', s3)) 2 
+3
python string width unicode python-unicode
source share
1 answer

You have several options:

  • Some consoles support escape sequences for precise cursor positioning. Perhaps this will lead to some overprint.

    Historical note. This approach was used in the Amiga terminal to display images in the console window by printing a line of text and then moving the cursor down one pixel. The remaining pixels of the text line slowly created the image.

  • Create a table in your code that contains the real (pixel) widths of all Unicode characters in the font used in the console / terminal window. To create this table, use the UI framework and a small Python script.

    Then add code that calculates the actual width of the text using this table. However, the result may not be a multiple of the width of the characters in the console. Together with the pixel movement of the cursor, this can solve your problem.

    Note. You will need to add special handling for ligatures (fi, fl) and composites . In addition, you can load the user interface infrastructure without opening a window and use graphical primitives to calculate line widths.

  • Use the tab character ( \t ) for indentation. But this will only help if your shell actually uses the actual width of the text to position the cursor. Many terminals will simply read characters.

  • Create an HTML file with a table and view it in a browser.

+1
source share

All Articles