Python Lists / Dictionaries vs. Numpy Arrays: Performance and Memory Management

Question

Python Lists / Dictionaries vs. Numpy Arrays: Performance and Memory Management

I need to iteratively read data files and store data in (numpy) arrays. I decided to save the data in the dictionary of "data fields": {'field1': array1, 'field2': array2, ...}.

Case 1 (lists):

Using lists (or collections.deque () ) to add new arrays of data, the code is efficient . But when I combine the arrays stored in the lists, the memory grows , and I could not free it again. Example:

filename = 'test' # data file with a matrix of shape (98, 56) nFields = 56 # Initialize data dictionary and list of fields dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = [] # Read a data file N times (it represents N files reading) # file contains 56 fields of arbitrary length in the example # Append each time the data fields to the lists (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field].append(xy[:,i]) # concatenate list members (arrays) to a numpy array for key,value in dataDict.iteritems(): dataDict[key] = np.concatenate(value,axis=0)

Calculation time : 63.4 s
Memory usage (above): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1: 00.44 python

Case 2 (numpy arrays):

Concatenating numpy arrays every time they are read is inefficient , but memory remains under control , Example:

 nFields = 56 dataDict = {} # data directory: each entry contains a list field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = np.array([]) # Read a data file N times (it represents N files reading) # Concatenate data fields to numpy arrays (in the data dictionary) N = 10000 for j in xrange(N): xy = np.loadtxt(filename) for i,field in enumerate(field_names): dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))

Calculation time : 1377.8 s
Memory usage (above): 14850 gime_se 20 0 650 m 542 m 4144 S 0 3.4 22: 31.21 python

Question (s):

Is there a way to have Case 1 performance, but keeping the memory under control, as in case 2 ?
In case 1, it seems that memory grows when combining list items (np.concatenate (value, axis = 0)). Best ideas to do this?

+8

performance python memory-management

chan gimeno Feb 08 '11 at 16:56

source share

3 answers

Justin peel · Answer 1 · 2011-02-08T21:10:32+0000

Here is what I observed. There is actually no memory leak. Instead, the Python memory management code (possibly in connection with managing the memory of any OS you are in) decides to save the space used by the original dictionary (without concatenated arrays) in the program. However, it can be reused. I proved this by doing the following:

Generate the code you gave as an answer to the function that returned the dataDict.
Calling the function twice and assigning the results to two different variables.

When I do this, I believe that the amount of memory used has increased from ~ 900 GB to ~ 1.3 GB. Without additional dictionary memory, Numpy data itself should occupy about 427 MB, so this adds. The second source, non-specific dictionary created by our function simply used the already allocated memory.

If you really can’t use more than 600 MB of memory, I would recommend doing a few with your Numpy arrays as it does inside Python lists: allocate an array with a certain number of columns, you used them, create an enlarged array with a lot of columns and copy the data. This will reduce the number of concatenations, which means that it will be faster (although not as fast as lists), while preserving the used memory. Of course, this is also a pain in oneself.

chan gimeno · Answer 2 · 2011-02-08T17:46:10+0000

A simpler example that reproduces the increase in memory usage:

 # Initialize data dictionary and list of fields dataDict = {} nFields = 56 field_names = [] for i in xrange(nFields): field_names.append(repr(i)) dataDict[repr(i)] = [] N = 10000 for j in xrange(N): print j+1,'of',N for i,field in enumerate(field_names): a = np.arange(100.) dataDict[field].append(a) for key,value in dataDict.iteritems(): dataDict[key] = np.concatenate(value,axis=0)

Memory usage (above): 24753 gime_se 20 0 1056m 948m 4104 S 0 5.9 0: 02.72 python.

After deleting the created objects:

 del a, dataDict

Nevertheless:

Memory usage (above): 24753 gime_se 20 0 628m 520 m 4128 S 0 3.2 0: 02.76 python

Can this memory be freed?

chan gimeno · Answer 3 · 2011-02-08T23:14:17+0000

I simplified the code even further using only numpy lists and arrays:

 b = [] N = 100000 for j in xrange(N): a = np.arange(1000.) b.append(a) b = np.concatenate(b,axis=0)

which consumes about twice as much b = np.arange (100,000,000.) (~ 800M). After del a and del b used memory drops to ~ 800M. As Justin Peel noted, python retains the memory of the remote list, but this memory is reused: running the same code doubles the same amount of memory as once. Whatever happens, this is due to the concatenation of the list items. Any idea on how to free up the space allocated for the remote list? Is it possible?

Python Lists / Dictionaries vs. Numpy Arrays: Performance and Memory Management

Case 1 (lists):

Case 2 (numpy arrays):

Question (s):

More articles: