Python equivalent to piping zcat result to filehandle in Perl

I have a huge pipeline written in Python that uses very large .gz files (~ 14 GB compression), but it is best to send certain lines to external software ( formatdb from blast-legacy / 2.2.26 ). I have a Perl script someone wrote for me a long time ago that does it very quickly, but I need to do the same in Python, given that the rest of the pipeline is written in Python, and I have to keep it that way, Perl The script uses two file descriptors, one to store the zcat .gz file, and the other to store the lines needed by the software (2 out of 4), and use it as input. This is due to bioinformatics, but experience is not required. The file is in fastq format, and the software needs it in fasta format. Each 4 lines is a fastq record, occupy the 1st and 3rd lines and add a '>' to the beginning of the 1st line, and this is the fasta equivalent that formatdb software will use for each record.

The perl script looks like this:

#!/usr/bin/perl my $SRG = $ARGV[0]; # reads.fastq.gz open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n"; # -i: input -n: db name -p: program open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n"; #Fastq => Fasta sub my $localcounter = 0; while (my $line = <$fh>){ if ($. % 4==1){ print $fh2 "\>" . substr($line, 1); $localcounter++; } elsif ($localcounter == 1){ print $fh2 "$line"; $localcounter = 0; } else{ } } close $fh; close $fh2; exit; 

It works very well. How can I do the same in Python? I love how Perl can use these file descriptors, but I'm not sure how to do this in Python without creating the actual file. All I can think of is a gzip.open file and write two lines of each record that I need for the new file and use them with "formatdb", but it's too slow. Any ideas? I need to process it in a python pipeline, so I cannot just rely on a perl script, and I would also like to know how to do it in general. I assume that I need to use some form of subprocess module.

Here is my Python code, but again this is a way to slow down and speed here (HUGE FILES):

 #!/usr/bin/env python import gzip from Bio import SeqIO # can recognize fasta/fastq records import subprocess as sp import os,sys filename = sys.argv[1] # reads.fastq.gz tempFile = filename + ".temp.fasta" outFile = open(tempFile, "w") handle = gzip.open(filename, "r") # parses each fastq record # r.id and r.seq are the 1st and 3rd lines of each record for r in SeqIO.parse(handle, "fastq"): outFile.write(">" + str(r.id) + "\n") outFile.write(str(r.seq) + "\n") handle.close() outFile.close() cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F ' sp.call(cmd, shell=True) cmd = 'rm ' + tempFile sp.call(cmd, shell=True) 
+5
source share
2 answers

Firstly, there is a much better solution for both Perl and Python: just use the gzip library. In Python, there is in stdlib ; in Perl you can find it on CPAN. For instance:

 with gzip.open(path, 'r', encoding='utf-8') as f: for line in f: do_stuff(line) 

Much simpler, more efficient, and more portable than cropping to zcat .


But if you really want to start the subprocess and manage your pipes in Python, you can do this with the subprocess module. And, unlike perl, Python can do this without having to embed a shell in the middle. There's even a good section in the docs, โ€œReplacing Old Functions with a subprocess Module,โ€ which gives you recipes.

So:

 zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE) 

Now zcat.stdout is a file-like object with the usual read methods, etc., transferring the pipe to the zcat subprocess.

So, for example, to read the 8K binary at a time in Python 3.x:

 zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE) for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''): do_stuff(chunk) zcat.wait() 

(If you want to do this in Python 2.x or read a text file one line at a time, and not the 8K binary at a time, or something else, the changes will be the same as for any other file processing encoding. )

+6
source

You can parse the entire file and upload it as a list of lines using this function:

  def convert_gz_to_list_of_lines(filepath): """Parse gz file and convert it into a list of lines.""" file_as_list = list() with gzip.open(filepath, 'rt', encoding='utf-8') as f: try: for line in f: file_as_list.append(line) except EOFError: file_as_list = file_as_list return file_as_list 
0
source

All Articles