Unicode strings in Python 3 still depend on narrow / wide constructs?

Question

Unicode strings in Python 3 still depend on narrow / wide constructs?

Since Python 2.2 and PEP 261 , Python can be built in narrow or wide mode, which affects the definition of character, i.e. "Python Unicode Address Unit Unit".

The characters in narrow lines look like UTF-16 code units:

>>> a = u'\N{MAHJONG TILE GREEN DRAGON}' >>> a u'\U0001f005' >>> len(a) 2 >>> a[0], a[1] (u'\ud83c', u'\udc05') >>> [hex(ord(c)) for c in a.encode('utf-16be')] ['0xd8', '0x3c', '0xdc', '0x5']

(The above does not seem to agree with some sources that insist that narrow uses UCS-2 rather than UTF-16. Very intriguing)

Does Python 3.0 support this difference? Or is everything Python 3 building?

(I heard of PEP 393 , which changes the internal representation of strings to 3.3, but this does not apply to 3.0 ~ 3.2.)

+8

python python-3.x unicode

Kos Feb 09 '13 at 19:34

source share

1 answer

Jbernardo · Accepted Answer · 2013-02-09T20:02:51+0000

Yes, they do from 3.0 to 3.2. Windows uses narrow assemblies, while (most) Unix uses wide assemblies

Using Python 3.2 on Windows:

 >>> a = '\N{MAHJONG TILE GREEN DRAGON}' >>> len(a) 2 >>> a '🀅'

Although this behavior is expected in version 3.3+ using Windows:

 >>> a = '\N{MAHJONG TILE GREEN DRAGON}' >>> len(a) 1 >>> a '\U0001f005' >>> print(a) Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> print(a) UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f005' in position 0: Non-BMP character not supported in Tk

The UCS-2 codec is used on Tk (I use IDLE - the terminal may show another error).

Unicode strings in Python 3 still depend on narrow / wide constructs?

More articles: