The code that converts strings to int in CPython 3.6 requests a UTF-8 string form to work with :
buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);
and the line creates a UTF-8 view on the first request and caches it to the string object :
if (PyUnicode_UTF8(unicode) == NULL) { assert(!PyUnicode_IS_COMPACT_ASCII(unicode)); bytes = _PyUnicode_AsUTF8String(unicode, NULL); if (bytes == NULL) return NULL; _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1); if (_PyUnicode_UTF8(unicode) == NULL) { PyErr_NoMemory(); Py_DECREF(bytes); return NULL; } _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes); memcpy(_PyUnicode_UTF8(unicode), PyBytes_AS_STRING(bytes), _PyUnicode_UTF8_LENGTH(unicode) + 1); Py_DECREF(bytes); }
Additional 3 bytes to represent UTF-8.
You might be wondering why the size does not change when the line looks like '40' or 'plain ascii text' . This is because if the string is in a "compact ascii" view , Python does not create a separate UTF-8 view. It returns an ASCII representation directly that UTF-8 already operates:
#define PyUnicode_UTF8(op) \ (assert(_PyUnicode_CHECK(op)), \ assert(PyUnicode_IS_READY(op)), \ PyUnicode_IS_COMPACT_ASCII(op) ? \ ((char*)((PyASCIIObject*)(op) + 1)) : \ _PyUnicode_UTF8(op))
You may also wonder why the size does not change for something like '1' . This U + FF11 is FULLWIDTH DIGIT ONE, which int considers as equivalent to '1' . This is because one of the previous steps in the line-in process
asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);
which converts all white space characters to ' ' and converts all Unicode decimal places to the corresponding ASCII digits. This conversion returns the original row if it does not change anything, but when it makes changes, it creates a new row, and the new row is the one that receives the created UTF-8 view.
As for the cases when the int call on one line looks as if it affects another, it is actually the same string object. There are many conditions under which Python will use strings, just as firmly in the Weird Implementation Detail Land, as everything that we have discussed so far. For 'ñ' reuse occurs because it is a single-character string in the Latin-1 range ( '\x00' - '\xff' ), and the implementation stores and reuses those .