Unicode strings in Python 3 still depend on narrow / wide constructs?

Since Python 2.2 and PEP 261 , Python can be built in narrow or wide mode, which affects the definition of character, i.e. "Python Unicode Address Unit Unit".

The characters in narrow lines look like UTF-16 code units:

>>> a = u'\N{MAHJONG TILE GREEN DRAGON}' >>> a u'\U0001f005' >>> len(a) 2 >>> a[0], a[1] (u'\ud83c', u'\udc05') >>> [hex(ord(c)) for c in a.encode('utf-16be')] ['0xd8', '0x3c', '0xdc', '0x5'] 

(The above does not seem to agree with some sources that insist that narrow uses UCS-2 rather than UTF-16. Very intriguing)

Does Python 3.0 support this difference? Or is everything Python 3 building?

(I heard of PEP 393 , which changes the internal representation of strings to 3.3, but this does not apply to 3.0 ~ 3.2.)

+8
python unicode
source share
1 answer

Yes, they do from 3.0 to 3.2. Windows uses narrow assemblies, while (most) Unix uses wide assemblies

Using Python 3.2 on Windows:

 >>> a = '\N{MAHJONG TILE GREEN DRAGON}' >>> len(a) 2 >>> a '🀅' 

Although this behavior is expected in version 3.3+ using Windows:

 >>> a = '\N{MAHJONG TILE GREEN DRAGON}' >>> len(a) 1 >>> a '\U0001f005' >>> print(a) Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> print(a) UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f005' in position 0: Non-BMP character not supported in Tk 

The UCS-2 codec is used on Tk (I use IDLE - the terminal may show another error).

+9
source share

All Articles