I have a Python 3.x program that processes several large text files containing large amounts of data that can be cleared from time to time from the memory limitations of my small workstation. From some basic memory profiling, it seems like using a generator, using the memory of my script balls to store consecutive elements, using twice as much memory as I expect.
I made a simple, standalone example to test a generator, and I get similar results in Python 2.7, 3.3, and 3.4. My test code follows, memory_usage() is a modified version of this function from the SO question , which uses /proc/self/status and agrees with top when I look at it. resource is probably a more cross-platform method:
import sys, resource, gc, time def biggen(): sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1 for size in sizes: data = [1] * int(size * 1e6) #time.sleep(1) yield data def consumer(): for data in biggen(): rusage = resource.getrusage(resource.RUSAGE_SELF) peak_mb = rusage.ru_maxrss/1024.0 print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format( peak_mb, len(data)/1e6)) #print(memory_usage()) # data = None # go del data # away gc.collect() # please. # def memory_usage(): # """Memory usage of the current process, requires /proc/self/status""" # # /questions/45609/python-equivalent-of-phps-memorygetusage/325630#325630 # result = {'peak': 0, 'rss': 0} # for line in open('/proc/self/status'): # parts = line.split() # key = parts[0][2:-1].lower() # if key in result: # result[key] = int(parts[1])/1024.0 # return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result) print(sys.version) consumer()
In practice, I will process the data coming from such a generator cycle, storing only what I need, and then discard it.
When I run the above script, and two large elements enter the series (the data size can be very variable), it looks like Python calculates the next one before freeing the previous one, which will double the memory usage.
$ python genmem.py 2.7.3 (default, Sep 26 2013, 20:08:41) [GCC 4.6.3] Peak: 7.9 MB, Data Len: 1.0 M Peak: 11.5 MB, Data Len: 1.0 M Peak: 45.8 MB, Data Len: 10.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 10.0 M # ^^ not much different versus previous 10M-list Peak: 80.2 MB, Data Len: 10.0 M # ^^ same list size, but new memory peak at roughly twice the usage Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 118.3 MB, Data Len: 20.0 M # ^^ and again... (20+10)*x Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 20.0 M Peak: 156.5 MB, Data Len: 20.0 M # ^^ and again. (20+20)*x Peak: 156.5 MB, Data Len: 1.0 M Peak: 156.5 MB, Data Len: 1.0 M
A crazy approach to the tape and suspenders and groove data = None , del data and gc.collect() does nothing.
I am sure that the generator itself does not double the memory, because otherwise the one big value that it gives will increase peak usage, and a large object will appear in the same iteration; these are only large sequential objects.
How to save memory?