How to control padding a Unicode string containing East Asian characters

I have three UTF-8 bites:

hello, world hello, 世界 hello, 世rld 

I need only the first 10 ascii-char -width so that the bracket in one column:

 [hello, wor] [hello, 世 ] [hello, 世r] 

In the console:

 width('世界')==width('worl') width('世 ')==width('wor') #a white space behind '世' 

One Chinese char is three bytes, but when displayed in the console, it has only 2 ascii widths:

 >>> bytes("hello, 世界", encoding='utf-8') b'hello, \xe4\xb8\x96\xe7\x95\x8c' 

python format() doesn't help when UTF-8 characters mix in

 >>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]: ... print(s) ... [hello, wor] [hello, 世界 ] [hello, 世rl] 

It's not beautiful:

  -----------Songs----------- | 1: 蝴蝶 | | 2: 心之城 | | 3: 支持你的爱人 | | 4: 根生的种子 | | 5: 鸽子歌(CUCURRUCUCU PALO| | 6: 林地之间 | | 7: 蓝光 | | 8: 在你眼里 | | 9: 肖邦离别曲 | | 10: 西行( 魔戒王者再临主题曲)(INTO | | X 11: 深陷爱河 | | X 12: 钟爱大地(THE MO RUN AIR | | X 13: 时光流逝 | | X 14: 卡农 | | X 15: 舒伯特小夜曲(SERENADE) | | X 16: 甜蜜的摇篮曲(Sweet Lullaby| --------------------------- 

So, I wonder if there is a standard way to populate UDF-8?

+4
python unicode string-formatting
source share
4 answers

When trying to line up ASCII text with a Chinese font with a fixed width, there is a set of full-sized versions of printed ASCII characters. Below I made an ASCII translation table for the full version:

 # coding: utf8 # full width versions (SPACE is non-contiguous with ! through ~) SPACE = '\N{IDEOGRAPHIC SPACE}' EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}' TILDE = '\N{FULLWIDTH TILDE}' # strings of ASCII and full-width characters (same order) west = ''.join(chr(i) for i in range(ord(' '),ord('~'))) east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE))) # build the translation table full = str.maketrans(west,east) data = '''\蝴蝶(A song)心之城(Another song)支持你的爱人(Yet another song)根生的种子鸽子歌(Cucurrucucu palo whatever)林地之间蓝光在你眼里肖邦离别曲西行(魔戒王者再临主题曲)(Into something)深陷爱河钟爱大地时光流逝卡农舒伯特小夜曲(SERENADE)甜蜜的摇篮曲(Sweet Lullaby) ''' # Replace the ASCII characters with full width, and create a song list. data = data.translate(full).rstrip().split('\n') # translate each printable line. print(' ----------Songs-----------'.translate(full)) for i,song in enumerate(data): line = '|{:4}: {:20.20}|'.format(i+1,song) print(line.translate(full)) print(' --------------------------'.translate(full)) 

Exit

  ----------Songs----------- |   1: 蝴蝶(A song)          | |   2: 心之城(Another song)   | |   3: 支持你的爱人(Yet another s| |   4: 根生的种子               | |   5: 鸽子歌(Cucurrucucu palo| |   6: 林地之间                | |   7: 蓝光                  | |   8: 在你眼里                | |   9: 肖邦离别曲               | |  10: 西行(魔戒王者再临主题曲)(Into s| |  11: 深陷爱河                | |  12: 钟爱大地                | |  13: 时光流逝                | |  14: 卡农                  | |  15: 舒伯特小夜曲(SERENADE)    | |  16: 甜蜜的摇篮曲(Sweet Lullaby| -------------------------- 

It is not too beautiful, but it will even out.

+9
source share

Firstly, it looks like you are using Python 3, so I will respond accordingly.

I may not understand your question, but it looks like you get what you want, except that the Chinese characters are wider in your font.

So, UTF-8 is a red herring, because we are not talking about bytes, we are talking about characters. You are in Python 3, so all lines are Unicode. The byte representation (where each of these Chinese characters is represented by three bytes) does not matter.

You want to copy or put each line exactly 10 characters, and it works correctly:

 >>> len('hello, wor') 10 >>> len('hello, 世界 ') 10 >>> len('hello, 世rl') 10 

The only problem is that you are looking at it with what appears to be a monospaced font, but in reality it is not. Most monospace fonts have this problem. All regular Latin letters have exactly the same width in this font, but Chinese characters are slightly wider. Therefore, the three characters "世界 " occupy more horizontal space than the three characters "wor" . There is not much that you can do about this, except for: a) getting a font that is truly monospaced or b) calculating the accuracy of how wide the character is in your font, and adding a few spaces that roughly take you to same horizontal position (this will never be accurate).

+3
source share

There seems to be no official support for this, but the built-in package may help:

 >>> import unicodedata >>> print unicodedata.east_asian_width(u'中') 

The return value represents the code point category . In particular,

  • W - East Asian Wide
  • F - East Asia Full Width (Narrow)
  • Na - East Asia Narrow
  • H - East Asian Half Width (Wide)
  • A - East Asian Ambiguous
  • N - Not East Asia

This answer on a similar question provided a quick solution. However, note that the display result depends on the font used in the monospace font. The default fonts used by ipython and pydev do not work well, and the Windows console is fine.

+3
source share

Take a look at the kitchen . I think what you want is possible.

+2
source share

All Articles