I have a huge pipeline written in Python that uses very large .gz files (~ 14 GB compression), but it is best to send certain lines to external software ( formatdb from blast-legacy / 2.2.26 ). I have a Perl script someone wrote for me a long time ago that does it very quickly, but I need to do the same in Python, given that the rest of the pipeline is written in Python, and I have to keep it that way, Perl The script uses two file descriptors, one to store the zcat .gz file, and the other to store the lines needed by the software (2 out of 4), and use it as input. This is due to bioinformatics, but experience is not required. The file is in fastq format, and the software needs it in fasta format. Each 4 lines is a fastq record, occupy the 1st and 3rd lines and add a '>' to the beginning of the 1st line, and this is the fasta equivalent that formatdb software will use for each record.
The perl script looks like this:
#!/usr/bin/perl my $SRG = $ARGV[0]; # reads.fastq.gz open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n"; # -i: input -n: db name -p: program open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n"; #Fastq => Fasta sub my $localcounter = 0; while (my $line = <$fh>){ if ($. % 4==1){ print $fh2 "\>" . substr($line, 1); $localcounter++; } elsif ($localcounter == 1){ print $fh2 "$line"; $localcounter = 0; } else{ } } close $fh; close $fh2; exit;
It works very well. How can I do the same in Python? I love how Perl can use these file descriptors, but I'm not sure how to do this in Python without creating the actual file. All I can think of is a gzip.open file and write two lines of each record that I need for the new file and use them with "formatdb", but it's too slow. Any ideas? I need to process it in a python pipeline, so I cannot just rely on a perl script, and I would also like to know how to do it in general. I assume that I need to use some form of subprocess module.
Here is my Python code, but again this is a way to slow down and speed here (HUGE FILES):
#!/usr/bin/env python import gzip from Bio import SeqIO # can recognize fasta/fastq records import subprocess as sp import os,sys filename = sys.argv[1] # reads.fastq.gz tempFile = filename + ".temp.fasta" outFile = open(tempFile, "w") handle = gzip.open(filename, "r") # parses each fastq record # r.id and r.seq are the 1st and 3rd lines of each record for r in SeqIO.parse(handle, "fastq"): outFile.write(">" + str(r.id) + "\n") outFile.write(str(r.seq) + "\n") handle.close() outFile.close() cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F ' sp.call(cmd, shell=True) cmd = 'rm ' + tempFile sp.call(cmd, shell=True)