Multicpu bzip2 using python script

I want bzip2 to quickly compress a few hundred gigabytes of data using an 8-core workstation with a working capacity of 16 GB. I am currently using a simple python script to compress an entire directory tree using bzip2 and the os.system call associated with the os.walk call.

I see that bzip2 uses only one processor, while the other cpus remain relatively inactive.

I am new to processes with queue and threads. But I am wondering how I can implement this so that I have four bzip2 streams (actually I think os.system streams), each of which is probably cpu, which drain files from the queue, since they bzip them.

Here is my single thread script added.

import os import sys for roots, dirlist , filelist in os.walk(os.curdir): for file in [os.path.join(roots,filegot) for filegot in filelist]: if "bz2" not in file: print "Compressing %s" % (file) os.system("bzip2 %s" % file) print ":DONE" 
+4
source share
2 answers

Try this code from MRAB on comp.lang.python:

 import os import sys from threading import Thread, Lock from Queue import Queue def report(message): mutex.acquire() print message sys.stdout.flush() mutex.release() class Compressor(Thread): def __init__(self, in_queue, out_queue): Thread.__init__(self) self.in_queue = in_queue self.out_queue = out_queue def run(self): while True: path = self.in_queue.get() sys.stdout.flush() if path is None: break report("Compressing %s" % path) os.system("bzip2 %s" % path) report("Done %s" % path) self.out_queue.put(path) in_queue = Queue() out_queue = Queue() mutex = Lock() THREAD_COUNT = 4 worker_list = [] for i in range(THREAD_COUNT): worker = Compressor(in_queue, out_queue) worker.start() worker_list.append(worker) for roots, dirlist, filelist in os.walk(os.curdir): for file in [os.path.join(roots, filegot) for filegot in filelist]: if "bz2" not in file: in_queue.put(file) for i in range(THREAD_COUNT): in_queue.put(None) for worker in worker_list: worker.join() 
+1
source

Use the subprocess module to start several processes simultaneously. If N of them is running (N should be slightly larger than the number of processors that you have, say 3 for 2 cores, 10 for 8), wait for one to finish, and then start the other.

Please note that this may not help much, as there will be a lot of disk activity that you cannot parallelize. A large amount of free memory for caches helps.

+1
source

All Articles