Non-uniform memory consumption in Python2 dictionaries

Question

Non-uniform memory consumption in Python2 dictionaries

Can anyone explain this non-monotonous use of dictionary memory in CPython 2.7?

>>> import sys >>> sys.getsizeof({}) 280 >>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}) 280 >>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6}) 1048 >>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7}) 1048 >>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e ight': 8}) 664 >>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e ight': 8, 'nine': 9}) 664

Python3 is reasonable here, it prints the size {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7} like 480.

I tried this on Ubuntu 15.10 and OS X 10.11.

+7

python python-2.7 cpython memory python-internals

Dmytro sirenko Apr 30 '16 at 10:58

source share

2 answers

Well, I tried a little and looked:

 dct = {'four': 3, 'three': 2, 'two': 1, 'one': 0} print(sys.getsizeof(dct)) # = 272 print(sys.getsizeof(dict(dct))) # = 272 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 272 dct = {'four': 3, 'three': 2, 'five': 4, 'two': 1, 'one': 0} print(sys.getsizeof(dct)) # = 272 print(sys.getsizeof(dict(dct))) # = 272 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 272 dct = {'six': 5, 'three': 2, 'two': 1, 'four': 3, 'five': 4, 'one': 0} print(sys.getsizeof(dct)) # = 1040 print(sys.getsizeof(dict(dct))) # = 656 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 1040 dct = {'seven': 6, 'six': 5, 'three': 2, 'two': 1, 'four': 3, 'five': 4, 'one': 0} print(sys.getsizeof(dct)) # = 1040 print(sys.getsizeof(dict(dct))) # = 656 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 1040 dct = {'seven': 6, 'six': 5, 'three': 2, 'two': 1, 'four': 3, 'five': 4, 'eight': 7, 'one': 0} print(sys.getsizeof(dct)) # = 656 print(sys.getsizeof(dict(dct))) # = 1040 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 1040

I’m not sure what kind of optimization is going on here, but I assume that these structures use different “best practices”. I mean when to allocate how much memory for the hash table. For example, if you have eleven or more elements, you get another strange mismatch:

 dct = {1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10:10, 11:11} print(sys.getsizeof(dct)) # = 1808 print(sys.getsizeof(dict(dct))) # = 1040 print(sys.getsizeof({k: v for k, v in dct.items()})) # = 1040

Thus, this is probably just some “optimization” of memory consumption when creating dictionaries in different ways, why does this non-monotonous outlier exist for literal syntax using 6 or 7 elements: I don’t know. Maybe some memory optimization went wrong and is it a bug that allocates too much memory? I have not read the source code yet.

0

Mseifert Apr 30 '16 at 13:57

source share

user2357112 · Accepted Answer · 2017-02-09T00:22:53+0000

TL; DR: 6-and 7-input liter-sheets do not fix the hash table well, and then increase the size by four times in size.

When CPython 2.7 evaluates the dict literal before it starts filling in the entries, the BUILD_MAP that it uses to create the dict is BUILD_MAP . This takes one argument, a hint about how many records the dict will contain, which it uses to configure the dict :

  TARGET(BUILD_MAP) { x = _PyDict_NewPresized((Py_ssize_t)oparg); PUSH(x); if (x != NULL) DISPATCH(); break; }

This is done in order to minimize the number of dict size changes during creation, but since they did not take into account the load factor, it does not completely eliminate the size changes.

As the comments on the source code indicate, _PyDict_NewPresized intended to "Create a new dictionary with a preliminary size to store the estimated number of elements." The exact size of the hash table in the generated dict depends on a number of implementation details, such as the minimum size ( #define PyDict_MINSIZE 8 ) and the requirement that the size be 2 (to avoid the need for separation in the implementation).

For bit literals up to 7 records, _PyDict_NewPresized initializes a hash table with 8 inputs; for 8 records, it initializes a hash table with 16 inputs, because the resizing procedure used in it selects a capacity larger than the argument.

Dicts resize when pasted when they become at least 2/3. For 6 and 7-tit dict literals, dict starts with 8 entries, so resizing occurs on the 6th insert. The recorder is small enough to four times increase the size of the hash table:

 return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

mp->ma_used - the number of entries used in the hash table, 6 at this point. 6 is less than 50,000, so we call dictresize(mp, 4 * 6) , which changes the size of the hash table by 32 entries, the least power 2 is greater than 24.

In contrast, for an 8-character dict literal, the hash table began with 16 entries. The dict does not become filled 2/3 at the time of creation, so the original hash table with 16 inputs is saved in the dict creation, and the resulting dict is less than with 6 and 7 decimal literals.

Python 3 uses a different growth policy , among other changes to the dict implementation, so you saw different results in Python 3.

Non-uniform memory consumption in Python2 dictionaries

More articles: