Python Multiprocessing Memory Usage

I wrote a program that can be summarized as follows:

def loadHugeData(): #load it return data def processHugeData(data, res_queue): for item in data: #process it res_queue.put(result) res_queue.put("END") def writeOutput(outFile, res_queue): with open(outFile, 'w') as f res=res_queue.get() while res!='END': f.write(res) res=res_queue.get() res_queue = multiprocessing.Queue() if __name__ == '__main__': data=loadHugeData() p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue)) p.start() processHugeData(data, res_queue) p.join() 

Real code (especially writeOutput() ) is much more complicated. writeOutput() uses only those values ​​that it takes as arguments (that is, it does not refer to data )

It basically loads a huge array of data into memory and processes it. Recording the output is delegated to the subprocess (in fact, it writes to several files, and this takes a lot of time). Thus, each time one data item is processed, it is sent to the subprocess through res_queue, which, in turn, writes the result to files as needed.

loadHugeData() any way to access, read or modify the data loaded by loadHugeData() . res_queue should only use what the main process sends through res_queue . And that leads me to my problem and question.

It seems to me that the subprocess gets it when copying a huge data set (when checking memory usage with top ). It's true? And if so, how can I avoid the identifier (using dual memory in essence)?

I am using Python 2.6 and the program runs on Linux.

+9
python memory-management linux multiprocessing
source share
1 answer

The multiprocessing module is effectively based on the fork system call, which creates a copy of the current process. Since you are loading huge data before fork (or creating multiprocessing.Process ), the child process inherits a copy of the data.

However, if the operating system you are running implements COW (copy-on-write), there will actually be only one copy of the data in physical memory, unless you change the data in the parent or child process (both parent and child will be together use the same pages of physical memory, albeit in different virtual address spaces); and even then additional memory will be allocated only for changes (in pagesize increments).

You can avoid this situation by calling multiprocessing.Process before loading your huge data. Then additional memory allocations will not be reflected in the child process when loading data in the parent.

+17
source share

All Articles