Generator using memory element [n-1] + item [n]

I have a Python 3.x program that processes several large text files containing large amounts of data that can be cleared from time to time from the memory limitations of my small workstation. From some basic memory profiling, it seems like using a generator, using the memory of my script balls to store consecutive elements, using twice as much memory as I expect.

I made a simple, standalone example to test a generator, and I get similar results in Python 2.7, 3.3, and 3.4. My test code follows, memory_usage() is a modified version of this function from the SO question , which uses /proc/self/status and agrees with top when I look at it. resource is probably a more cross-platform method:

 import sys, resource, gc, time def biggen(): sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1 for size in sizes: data = [1] * int(size * 1e6) #time.sleep(1) yield data def consumer(): for data in biggen(): rusage = resource.getrusage(resource.RUSAGE_SELF) peak_mb = rusage.ru_maxrss/1024.0 print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format( peak_mb, len(data)/1e6)) #print(memory_usage()) # data = None # go del data # away gc.collect() # please. # def memory_usage(): # """Memory usage of the current process, requires /proc/self/status""" # # /questions/45609/python-equivalent-of-phps-memorygetusage/325630#325630 # result = {'peak': 0, 'rss': 0} # for line in open('/proc/self/status'): # parts = line.split() # key = parts[0][2:-1].lower() # if key in result: # result[key] = int(parts[1])/1024.0 # return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result) print(sys.version) consumer() 

In practice, I will process the data coming from such a generator cycle, storing only what I need, and then discard it.

When I run the above script, and two large elements enter the series (the data size can be very variable), it looks like Python calculates the next one before freeing the previous one, which will double the memory usage.

 $ python genmem.py 2.7.3 (default, Sep 26 2013, 20:08:41) [GCC 4.6.3] Peak: 7.9 MB, Data Len: 1.0 M Peak: 11.5 MB, Data Len: 1.0 M Peak: 45.8 MB, Data Len: 10.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 10.0 M # ^^ not much different versus previous 10M-list Peak: 80.2 MB, Data Len: 10.0 M # ^^ same list size, but new memory peak at roughly twice the usage Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 118.3 MB, Data Len: 20.0 M # ^^ and again... (20+10)*x Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 20.0 M Peak: 156.5 MB, Data Len: 20.0 M # ^^ and again. (20+20)*x Peak: 156.5 MB, Data Len: 1.0 M Peak: 156.5 MB, Data Len: 1.0 M 

A crazy approach to the tape and suspenders and groove data = None , del data and gc.collect() does nothing.

I am sure that the generator itself does not double the memory, because otherwise the one big value that it gives will increase peak usage, and a large object will appear in the same iteration; these are only large sequential objects.

How to save memory?

+6
source share
3 answers

The problem is the generator function; especially in the statement:

  data = [1] * int(size * 1e6) 

Suppose you have old content in a data variable. When you run this statement, it first calculates the result, so you have 2 of these arrays in memory; Old and new. Only then did the data variable change to indicate the new structure and the old structure will be released. Try changing the iterator function to:

 def biggen(): sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1 for size in sizes: data = None data = [1] * int(size * 1e6) yield data 
+1
source

Have you tried using the gc module? There you can get a list of objects that still reference your big data between loops, check to see if there are any unreachable but incorrect objects in the list, or enable some debug flags.

With luck, a simple gc.collect() query after each loop can solve your problem on one line.

0
source

Instead:

  data = [1] * int(size * 1e6) #time.sleep(1) yield data 

Try:

  yield [1] * int(size * 1e6) 

The problem is that the local data generator variable stores a link to the assigned list, preventing it from garbage collection until the generator resumes and resets the link.

In other words, running del data outside the generator does not affect garbage collection, unless it is the only data reference. Avoiding the link inside the generator makes this true.

Adding

If you need to manipulate the data, firstly, you can use such a hack to remove the link before yielding it:

  data = [1] * int(size * 1e6) # ... do stuff with data ... # Yield data without keeping a reference to it: hack = [data] del data yield hack.pop() 
0
source

All Articles