I wrote a program that can be summarized as follows:
def loadHugeData(): #load it return data def processHugeData(data, res_queue): for item in data: #process it res_queue.put(result) res_queue.put("END") def writeOutput(outFile, res_queue): with open(outFile, 'w') as f res=res_queue.get() while res!='END': f.write(res) res=res_queue.get() res_queue = multiprocessing.Queue() if __name__ == '__main__': data=loadHugeData() p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue)) p.start() processHugeData(data, res_queue) p.join()
Real code (especially writeOutput() ) is much more complicated. writeOutput() uses only those values ββthat it takes as arguments (that is, it does not refer to data )
It basically loads a huge array of data into memory and processes it. Recording the output is delegated to the subprocess (in fact, it writes to several files, and this takes a lot of time). Thus, each time one data item is processed, it is sent to the subprocess through res_queue, which, in turn, writes the result to files as needed.
loadHugeData() any way to access, read or modify the data loaded by loadHugeData() . res_queue should only use what the main process sends through res_queue . And that leads me to my problem and question.
It seems to me that the subprocess gets it when copying a huge data set (when checking memory usage with top ). It's true? And if so, how can I avoid the identifier (using dual memory in essence)?
I am using Python 2.6 and the program runs on Linux.
python memory-management linux multiprocessing
Fableblaze
source share