Python dict memory usage and big data set variables

Question

Python dict memory usage and big data set variables

So, I am making a game in Python 3.4. In the game I need to track the map. This is a map of connected numbers, starting from (0,0) and continuing in each direction, generated by a randomly filtered method (only correct matches for the next position are used to select a random list).

I have several types of rooms that have a name and a list of doors:

RoomType = namedtuple('Room','Type,EntranceLst') typeA = RoomType("A",["Bottom"]) ...

For display at the moment I hold the position indicator and the type of room:

 currentRoomType = typeA currentRoomPos = (0,0) navMap = {currentRoomPos: currentRoomType}

I have a loop that generates 9,000,000 rooms to check memory usage. When I run it, I get about 600 and 800 MB. I was wondering if there is a way to optimize this.

I tried instead of doing

 navMap = {currentRoomPos: currentRoomType}

I would do

 navMap = {currentRoomPos: "A"}

but it does not have real changes in use.

Now I was wondering if I can - and should - save a list of all types, and for each type, save the positions at which this happens. I don't know, however, if that matters with how python manages its variables.

This is largely a mental experiment, but if something useful comes from it, I will probably implement it.

+5

python python-3.4

Shadowflame May 10 '15 at 8:41

source share

1 answer

Pascal bugnion · Accepted Answer · 2015-05-10T09:17:08+0000

You can use sys.getsizeof(object) to get the size of the Python object. However, you should be careful when calling sys.getsizeof in containers: it only displays the size of the container, not the content - see this recipe for an explanation of how to get the total size of the container, including the contents. In this case, we do not need to go so deep: we can simply manually add the size of the container and the size of its contents.

The sizes of the considered types:

 # room type size >>> sys.getsizeof(RoomType("A",["Bottom"])) + sys.getsizeof("A") + sys.getsizeof(["Bottom"]) + sys.getsizeof("Bottom") 233 # position size >>> sys.getsizeof((0,0)) + 2*sys.getsizeof(0) 120 # One character size >>> sys.getsizeof("A") 38

Look at the different options if you have N numbers:

Dictionary from position -> room_type . This includes storing N*(size(position) + size(room_type)) = 353 N bytes in memory.
Dictionary from position -> 1-character string . This includes storing N*158 bytes in memory.
Dictionary from type -> set of positions . This includes storing N*120 bytes plus tiny overhead when storing dictionary keys.

In terms of memory usage, the third option is clearly better. However, as often happens, you have a trade-off between processor memory. It’s worthwhile to dwell briefly on the computational complexity of the queries you are likely to make. To find the type of room according to its position, with each of the three options above you should:

Look at the position in the dictionary. This is an O (1) search, so you will always have the same time (approximately), regardless of the number of rooms (for a large number of rooms).
Same
Look at each type and for each type, ask if this item is in the set of items for this type. This is a search for O(ntypes) , that is, the time it takes is proportional to the number of types you have. Please note: if you sent a list instead of a set to store numbers of a certain type, it would increase to O(nrooms * ntypes) , which would kill your performance.

As always, when optimizing, it is important to consider the effect of optimization on both memory usage and processor time. The two often diverge.

Alternatively, you might consider storing types in a two-dimensional numpy character array if your map is fairly rectangular. I believe that it will be much more effective. Each character in the numpy array is a single byte, so the memory usage will be much less, and the CPU time will still be O (1) searched from the room position to print:

 # Generate random 20 x 10 rectangular map >>> map = np.repeat('a', 100).reshape(20, 10) >>> map.nbytes 200 # ie. 1 byte per character.

Some additional small-scale optimizations:

Encode the room type as an int, not a string. Ints are 24 bytes in size, and single-character strings are 38 in size.

Encode a position as a unit, not a tuple. For instance:

 # Random position xpos = 5 ypos = 92 # Encode the position as a single int, using high-order bits for x and low-order bits for y pos = 5*1000 + ypos # Recover the x and y values of the position. xpos = pos / 1000 ypos = pos % 1000

Note that this kills readability, so it’s only worth doing if you want to compress the last bit of performance. In practice, you can use power 2 rather than power 10 as a separator (but power 10 helps with debugging and readability). Note that this results in a number of bytes at each position from 120 to 24. If you go along this route, consider defining the Position class with __slots__ to tell Python how to allocate memory and add the xpos and ypos properties to the class. You do not want to clog your code with pos / 1000 and pos % 1000 .

Python dict memory usage and big data set variables

More articles: