How is Unicode represented inside Python?

How is a Unicode string literally represented in Python memory?

For example, I could visualize 'abc' as equivalent ASCII bytes in memory. The whole could be presented as a presentation of 2 compliments. However, u'\u2049' , although it is represented in UTF-8 as '\xe2\x81\x89' - 3 bytes, how can I visualize the letter u'\u2049' code in memory?

Is there a specific way to store in memory? Are Python 2 and Python 3 used differently?

A few related questions for anyone curious:

1) How are these lines represented inside the Python interpreter? I do not understand

2) What is the internal representation of a string in Python 3.x

+8
python string unicode python-internals
source share
1 answer

Python 2 and Python 3.0-3.2 use either UCS2 * or UCS4 for Unicode characters, i.e. it will use either 2 bytes or 4 bytes for each character. Which one is selected is a compile time option.

\u2049 then appears as \x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 is selected. ASCII characters in a unicode string still use 2 or 4 bytes per character.

Python 3.3 switched to a new internal representation, using the most compact form necessary to represent all characters in a string. Either 1 byte, 2 bytes, or 4 bytes are selected. ASCII and Latin-1 text uses only 1 byte per character, the rest of the BMP characters require 2 bytes, and then 4 bytes are used.

See PEP-393: Flexible String Representation to Fully Lower These Representations.


* From a technical point of view, the UCS-2 assembly uses UTF-16, since non-BMP characters use UTF-16 surrogates to encode up to 4 bytes (2 UTF-16 characters each). However, the Python documentation still treats this as UCS2.

This leads to unexpected behavior, for example, to len() for unicode strings without BMP, which are more than the number of characters contained.

+10
source share

All Articles