How to unzip multiple gz files in python using multithreading?

I have several gz files with a total size of about 120 GB. I want to unzip (gzip) these files into the same directory and delete the existing gz file. We are currently doing this manually, and you have more time to unzip with gzip -d <filename> .
Is there a way that I can unzip these files in parallel by creating a python script or any other technique. These files are currently located on a Linux machine.

+7
python multithreading linux gzip
source share
2 answers

You can do this very easily with multiprocessor pools :

 import gzip import multiprocessing import shutil filenames = [ 'a.gz', 'b.gz', 'c.gz', ... ] def uncompress(path): with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest: shutil.copyfileobj(src, dest) with multiprocessing.Pool() as pool: for _ in pool.imap_unordered(uncompress, filenames, chunksize=1): pass 

This code will spawn multiple processes, and each process will retrieve one file at a time.

Here I chose chunksize=1 to avoid process chunksize=1 if some files are larger than average.

+8
source share

Most of the wall clock time spent unpacking a file using gunzip or gzip -d will be performed from I / O operations (reading and writing to disk). This may be even more than the time taken to decompress the data. You can take advantage of this with several gzip jobs running in the background. Since some operations are blocked during I / O, another job may be executed without waiting in a queue.

You can speed up the decompression of an entire set of files with several gunzip processes running in the background. Each of them serves a specific set of files.

You can hack something easy in BASH. Separate the file list into separate commands and use & to start it as a background job. Then wait for each completion of each job.

I would recommend that you have 2 to 2 * N tasks running at the same time. Where N is the number of cores or logical processors on your computer. Experiment to get the correct number.

You can easily hack something in BASH.

 #!/bin/bash argarray=( " $@ " ) len=${#argarray[@]} #declare 4 empty array sets set1=() set2=() set3=() set4=() # enumerate over each argument passed to the script # and round robin add it to one of the above arrays i=0 while [ $i -lt $len ] do if [ $i -lt $len ]; then set1+=( "${argarray[$i]}" ) ((i++)) fi if [ $i -lt $len ]; then set2+=( "${argarray[$i]}" ) ((i++)) fi if [ $i -lt $len ]; then set3+=( "${argarray[$i]}" ) ((i++)) fi if [ $i -lt $len ]; then set4+=( "${argarray[$i]}" ) ((i++)) fi done # for each array, start a background job gzip -d ${set1[@]} & gzip -d ${set2[@]} & gzip -d ${set3[@]} & gzip -d ${set4[@]} & # wait for all jobs to finish wait 

In the above example, I selected 4 files per task and started two separate tasks. You can easily expand the script to have more jobs, more files for each process, and take file names as command line parameters.

0
source share

All Articles