Perform operations with a large dataset

Question

Perform operations with a large dataset

I need to do some analysis of a PSL record that contains information about fragments of a DNA sequence. Basically I have to find records that are from the same read in the same contig (these are both values in the PSL record). The problem is that PSL records are large (text documents 10-30 MB). I wrote a program that works with short recordings and long recordings, enough time, but it took longer than indicated. I was told that the program should not take more than ~ 15 seconds. It took me more than 15 minutes.

PSL entries are as follows:

275 11 0 0 0 0 0 0 - M02034:35:000000000-A7UU0:1:1101:19443:1992/2 286 0 286 NODE_406138_length_13407_cov_13.425076 13465 408 694 1 286, 0, 408, 171 5 0 0 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:13497:2001/2 294 0 176 NODE_500869_length_34598_cov_30.643419 34656 34334 34510 1 176, 0, 34334, 188 14 0 10 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:18225:2002/1 257 45 257 NODE_455027_length_12018_cov_13.759444 12076 11322 11534 1 212, 45, 11322,

My code is as follows:

 import sys class PSLreader : ''' Class to provide reading of a file containing psl alignments formatted sequences: object instantiation: myPSLreader = PSLreader(<file name>): object attributes: fname: the initial file name methods: readPSL() : reads psl file, yielding those alignments that are within the first or last 1000 nt readPSLpairs() : yields psl pairs that support a circular hypothesis Author: David Bernick Date: May 12, 2013 ''' def __init__ (self, fname=''): '''contructor: saves attribute fname ''' self.fname = fname def doOpen (self): if self.fname is '': return sys.stdin else: return open(self.fname) def readPSL (self): ''' using filename given in init, returns each filtered psl records that contain alignments that are within the terminal 1000nt of the target. Incomplete psl records are discarded. If filename was not provided, stdin is used. This method selects for alignments that could may be part of a circle. Illumina pairs aligned to the top strand would have read1(+) and read2(-). For the bottoms trand, read1(-) and read2(+). For potential circularity, these are the conditions that can support circularity: read1(+) near the 3' terminus read1(-) near the 5' terminus read2(-) near the 5' terminus read2(+) near the 3' terminus so... any read(+) near the 3', or any read(-) near the 5' ''' nearEnd = 1000 # this constant determines "near the end" with self.doOpen() as fileH: for line in fileH: pslList = line.split() if len(pslList) < 17: continue tSize = int(pslList[14]) tStart = int(pslList[15]) strand = str(pslList[8]) if strand.startswith('+') and (tSize - tStart > nearEnd): continue elif strand.startswith('-') and (tStart > nearEnd): continue yield line def readPSLpairs (self): read1 = [] read2 = [] for psl in self.readPSL(): parsed_psl = psl.split() strand = parsed_psl[9][-1] if strand == '1': read1.append(parsed_psl) elif strand == '2': read2.append(parsed_psl) output = {} for psl1 in read1: name1 = psl1[9][:-1] contig1 = psl1[13] for psl2 in read2: name2 = psl2[9][:-1] contig2 = psl2[13] if name1 == name2 and contig1 == contig2: try: output[contig1] += 1 break except: output[contig1] = 1 break print(output) PSL_obj = PSLreader('EEV14-Vf.filtered.psl') PSL_obj.readPSLpairs()

I was provided with sample code that looks like this:

 def doSomethingPairwise (a): for leftItem in a[1]: for rightItem in a[2]: if leftItem[1] is rightItem[1]: print (a) thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2], ['John', 'violin', 1], ['John', 'oboe', 2], ['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ] thisGroup = None thisGroupList = [ [], [], [] ] for name, instrument, num in thisStream: if name != thisGroup: doSomethingPairwise(thisGroupList) thisGroup = name thisGroupList = [ [], [], [] ] thisGroupList[num].append([name, instrument, num]) doSomethingPairwise(thisGroupList)

But when I tried to implement it, my program still took a lot of time. Am I thinking about it wrong? I understand that the nested loop is slow, but I see no alternative.

Edit: I realized the data was pre-edited, which made my brute force solution very impractical and unnecessary.

+5

python python-3.x bioinformatics

Sam May 26 '15 at 2:21

source share

1 answer

Jose Ricardo Bustos M. · Answer 1 · 2015-05-26T20:58:08+0000

Hope this helps, as this question needs a better sample input file

 #is better create PSLRecord class class PSLRecord: def __init__(self, line): pslList = line.split() properties = ("matches", "misMatches", "repMatches", "nCount", "qNumInsert", "qBaseInsert", "tNumInsert", "tBaseInsert", "strand", "qName", "qSize", "qStart", "qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount", "blockSizes", "qStarts", "tStarts") self.__dict__.update(dict(zip(properties, pslList))) class PSLreader : def __init__ (self, fname=''): self.fname = fname def doOpen (self): if self.fname is '': return sys.stdin else: return open(self.fname) def readPSL (self): with self.doOpen() as fileH: for line in fileH: pslrc = PSLRecord(line) yield pslrc #return a dictionary with all psl records group by qName and tName def readPSLpairs (self): dictpsl = {} for pslrc in self.readPSL(): #OP requirement, remove '1' or '2' char, in pslrc.qName[:-1] key = (pslrc.qName[:-1], pslrc.tName) if not key in dictpsl: dictpsl[key] = [] dictpsl[key].append(pslrc) return dictpsl #Function filter .... is better out and self-contained def f_filter(pslrec, nearEnd = 1000): if (pslrec.strand.startswith('+') and (int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)): return False if (pslrec.strand.startswith('-') and (int(pslrec.tStart) > nearEnd)): return False return True PSL_obj = PSLreader('EEV14-Vf.filtered.psl') #read dictionary of pairs dictpsl = PSL_obj.readPSLpairs() from itertools import product #product from itertools #(1) x (2,3) = (1,2),(1,3) output = {} for key, v in dictpsl.items(): name, contig = key #i get filters aligns in principal strand strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and pslrec.qName[-1] == '1'] #i get filters aligns in secondary strand strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and pslrec.qName[-1] == '2'] for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec): #This For has fewer comparisons, since I was grouped before if not contig in output: output[contig] = 1 output[contig] += 1

Note: 10-30 Mb is not a big file if you ask me

Perform operations with a large dataset

More articles: