Python top N word count why a multiprocessor is slower than a single process

I am doing frequency word counting using python, the only process version:

#coding=utf-8 import string import time from collections import Counter starttime = time.clock() origin = open("document.txt", 'r').read().lower() for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] #the words below will be ignoered when counting ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] i=0 for ch in for_split: origin = string.replace(origin, ch, ' ') words = string.split(origin) result = Counter(words).most_common(40) for word, frequency in result: if not word in ignored and i < 10: print "%s : %d" % (word, frequency) i = i+1 print time.clock() - starttime 

then the multiprocessing version is as follows:

 #coding=utf-8 import time import multiprocessing from collections import Counter for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] result_list = [] def worker(substr): result = Counter(substr) return result def log_result(result): result_list.append(result) def main(): pool = multiprocessing.Pool(processes=5) origin = open("document.txt", 'r').read().lower() for ch in for_split: origin = origin.replace(ch, ' ') words = origin.split() step = len(words)/4 substrs = [words[pos : pos+step] for pos in range(0, len(words), step)] result = Counter() for substr in substrs: pool.apply_async(worker, args=(substr,), callback = log_result) pool.close() pool.join() result = Counter() for item in result_list: result = result + item result = result.most_common(40) i=0 for word, frequency in result: if not word in ignored and i < 10: print "%s : %d" % (word, frequency) i = i+1 if __name__ == "__main__": starttime = time.clock() main() print time.clock() - starttime 

"document.txt" is about 22 M, my laptop has cores, 2G memory, the result of the first version is 3.27, and the second is 8.15 s, I changed several processes ( pool = multiprocessing.Pool (processes = 5) ) , from 2 to 10, the results remain almost the same, why is this the way I can get this program to run faser than one version of the process?

+3
python multiprocessing
source share
1 answer

I think that this is the overhead associated with the distribution of individual lines for workers and obtaining results. If I ran your parallel code, as described above, with an example document (Dostoevsky's "Crime and Punishment"), it takes about 0.32 s to execute, while the single-processor version takes only 0.09 s. If I modify the worker function to just process the string "test" instead of the actual document (still passing in the real string as an argument), the execution time is reduced to 0.22 s. However, if I pass the β€œtest” as an argument to the map_async function, the execution time is reduced to 0.06 s. Therefore, I would say that in your case the program execution time is limited by the overhead between the processes.

With the following code, I get the parallel version execution time up to 0.08 s: first, I split the file into several pieces with (almost) equal length, making sure that the border between the individual pieces matches the new line. Then I just pass the length and offsets of the pieces to each workflow, let it open the file, read the piece, process it and return the results. This, apparently, leads to significantly lower overhead than direct line distribution through the map_async function. For large file sizes, you can see the improvement at runtime using this code. In addition, if you can tolerate small counting errors, you can omit the step to determine the correct block boundaries and simply split the file into equally large pieces. In my example, this reduces the runtime to 0.04 s, making mp code faster than single-processor.

 #coding=utf-8 import time import multiprocessing import string from collections import Counter import os for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] result_list = [] def worker(offset,length,filename): origin = open(filename, 'r') origin.seek(offset) content = origin.read(length).lower() for ch in for_split: content = content.replace(ch, ' ') words = string.split(content) result = Counter(words) origin.close() return result def log_result(result): result_list.append(result) def main(): processes = 5 pool = multiprocessing.Pool(processes=processes) filename = "document.txt" file_size = os.stat(filename)[6] chunks = [] origin = open(filename, 'r') while True: lines = origin.readlines(file_size/processes) if not lines: break chunks.append("\n".join(lines)) lengths = [len(chunk) for chunk in chunks] offset = 0 for length in lengths: pool.apply_async(worker, args=(offset,length,filename,), callback = log_result) offset += length pool.close() pool.join() result = Counter() for item in result_list: result = result + item result = result.most_common(40) i=0 for word, frequency in result: if not word in ignored and i < 10: print "%s : %d" % (word, frequency) i = i+1 if __name__ == "__main__": starttime = time.clock() main() print time.clock() - starttime 
+4
source share

All Articles