Why does the size of this Python string change when an int conversion fails

From tweet here :

import sys x = 'ñ' print(sys.getsizeof(x)) int(x) #throws an error print(sys.getsizeof(x)) 

We get 74, then 77 bytes for two getsizeof calls.

It looks like we are adding 3 bytes to the object from the failed int call.

A few more examples from twitter (you may need to restart python to reset the size to 74):

 x = 'ñ' y = 'ñ' int(x) print(sys.getsizeof(y)) 

77

 print(sys.getsizeof('ñ')) int('ñ') print(sys.getsizeof('ñ')) 

74, then 77.

+68
python string unicode python-internals
Nov 01 '17 at 19:21
source share
2 answers

The code that converts strings to int in CPython 3.6 requests a UTF-8 string form to work with :

 buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen); 

and the line creates a UTF-8 view on the first request and caches it to the string object :

 if (PyUnicode_UTF8(unicode) == NULL) { assert(!PyUnicode_IS_COMPACT_ASCII(unicode)); bytes = _PyUnicode_AsUTF8String(unicode, NULL); if (bytes == NULL) return NULL; _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1); if (_PyUnicode_UTF8(unicode) == NULL) { PyErr_NoMemory(); Py_DECREF(bytes); return NULL; } _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes); memcpy(_PyUnicode_UTF8(unicode), PyBytes_AS_STRING(bytes), _PyUnicode_UTF8_LENGTH(unicode) + 1); Py_DECREF(bytes); } 

Additional 3 bytes to represent UTF-8.




You might be wondering why the size does not change when the line looks like '40' or 'plain ascii text' . This is because if the string is in a "compact ascii" view , Python does not create a separate UTF-8 view. It returns an ASCII representation directly that UTF-8 already operates:

 #define PyUnicode_UTF8(op) \ (assert(_PyUnicode_CHECK(op)), \ assert(PyUnicode_IS_READY(op)), \ PyUnicode_IS_COMPACT_ASCII(op) ? \ ((char*)((PyASCIIObject*)(op) + 1)) : \ _PyUnicode_UTF8(op)) 



You may also wonder why the size does not change for something like '1' . This U + FF11 is FULLWIDTH DIGIT ONE, which int considers as equivalent to '1' . This is because one of the previous steps in the line-in process

 asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u); 

which converts all white space characters to ' ' and converts all Unicode decimal places to the corresponding ASCII digits. This conversion returns the original row if it does not change anything, but when it makes changes, it creates a new row, and the new row is the one that receives the created UTF-8 view.




As for the cases when the int call on one line looks as if it affects another, it is actually the same string object. There are many conditions under which Python will use strings, just as firmly in the Weird Implementation Detail Land, as everything that we have discussed so far. For 'ñ' reuse occurs because it is a single-character string in the Latin-1 range ( '\x00' - '\xff' ), and the implementation stores and reuses those .

+69
Nov 01 '17 at 19:51
source share

According to the documentation here :

getizeof () calls the sizeof method of the objects and adds additional garbage collector overhead if the object is managed by the garbage collector.

But it has nothing to do with getizeof (). Confident :

It has nothing to do with the sys module and the sys.getsizeof() method, the problem is the __sizeof__ method. I can reproduce the error without it getsizeof() :

 x = 'ñ' print(x.__sizeof__()) #74 int('ñ') print(x.__sizeof__()) #77 

And an explanation of why this is happening was provided by @ user2357112 with the accepted answer

-one
Nov 01 '17 at 19:35
source share



All Articles