Improved DNA alignment degassing code design

This is a question regarding more efficient code design:

Suppose that three aligned DNA sequences (seq1, seq2 and seq3, each line) represent two genes (gene1 and gene2). The starting and ending positions of these genes are known relative to aligned DNA sequences.

# Input align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} 

I want to remove spaces (i.e. dashes) from alignment and maintain a relative relationship between the start and end positions of the genes.

 # Desired output align = {"seq1":"ATGCATGC", "seq2":"ATGC", "seq3":"ACAC"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,1], "gene2":[2,3]}, "seq3":{"gene1":[0,1], "gene2":[2,3]}} 

Getting the desired result is less trivial than it might seem. Below I wrote some (numbered lines) pseudo-code for this problem, but there is certainly a more elegant design.

 1 measure length of any aligned gene # take any seq, since all seqs aligned 2 list_lengths = list of gene lengths # order is important 3 for seq in alignment 4 outseq = "" 5 for each num in range(0, length(seq)) # weird for-loop is intentional 6 if seq[num] == "-" 7 current_gene = gene whose start/stop positions include num 8 subtract 1 from length of current_gene 9 subtract 1 from lengths of all genes following current_gene in list_lengths 10 else 11 append seq[num] to outseq 12 append outseq to new variable 13 convert gene lengths into start/stop positions and append ordered to new variable 

Can someone give me suggestions / examples for shorter and more direct code design?

+7
optimization python design algorithm dna-sequence
source share
4 answers

This answer processes the updated annos dictionary from the comment on the cdlanes answer. This answer leaves the annos dictionary with the wrong index [2,1] for gene2 . My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL spaces in this region. It should also be noted that if a gene contains only one letter in the align final, then anno[geneX] will have the same indices for start and stop → See seq3 gene1 from your annos comment.

 align = {"seq1":"ATGCATGC", "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} import re for name,anno in annos.items(): # indices of gaps removed usinig re removed = [(m.start(0)) for m in re.finditer(r'-', align[name])] # removes gaps from align dictionary align[name] = re.sub(r'-', '', align[name]) build_dna = '' for gene,inds in anno.items(): start_ind = len(build_dna)+1 #generator to sum the num '-' removed from gene num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1]) # build the de-gapped string build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "") end_ind = len(build_dna) if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps del annos[name][gene] #remove the gene entry continue #update the values in the annos dictionary annos[name][gene][0] = start_ind-1 annos[name][gene][1] = end_ind-1 

Results:

 In [3]: annos Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}} 

The results of the 3rd annos gene are higher. Just replace the annos variable:

 In [5]: annos3 Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]}, 'seq2': {'gene1': [0, 1], 'gene3': [2, 3]}, 'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}} 
+2
source share

The following corresponds to the output of an example program for both test cases:

 align = {"seq1":"ATGCATGC", "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} (START, STOP) = (0, 1) for alignment, sequence in align.items(): new_sequence = '' gap = 0 for position, codon in enumerate(sequence): if '-' == codon: for gene in annos[alignment].values(): if gene[START] > (position - gap): gene[START] -= 1 if gene[STOP] >= (position - gap): gene[STOP] -= 1 gap += 1 else: new_sequence += codon align[alignment] = new_sequence 

The result of doing this:

 python3 -i test.py >>> align {'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'} >>> >>> annos {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}} >>> 

I hope this is still elegant, straightforward, short and Pythonic enough!

+1
source share

My own solution to the above question is neither elegant nor Pythonic, but comes to the desired result. Any recommendations for improvement are welcome!

 import collections import operator # measure length of any aligned gene # take any seq, since all seqs aligned align_len = len(align.itervalues().next()) # initialize output align_out, annos_out = {}, {} # loop through annos for seqname, anno in annos.items(): # operate on ordered sequence lengths instead on ranges ordseqlens = collections.OrderedDict() # generate ordered sequence lengths for k,v in sorted(anno.items(), key=operator.itemgetter(1)): ordseqlens[k] = v[1]-v[0]+1 # start (and later append to) sequence output align_out[seqname] = "" # generate R-style for-loop for pos in range(0, len(align[seqname])): if align[seqname][pos] == "-": try: current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1]) except StopIteration: print("No annotation provided for position", pos, "in sequence", seqname) # subtract 1 from lengths of current_gene ordseqlens[current_gene] = ordseqlens[current_gene]-1 # append nucleotide unless a gap else: align_out[seqname] += align[seqname][pos] # convert modified ordered sequence lengths back into start/stop positions summ = 0 tmp_dict = {} for k,v in ordseqlens.items(): tmp_dict[k] = [summ, v+summ-1] summ = v+summ # save start/stop positions to correct annos annos_out[seqname] = tmp_dict 

The output of this code is:

 >>> align_out {'seq3': 'ACAC', 'seq2': 'ATGC', 'seq1': 'ATGCATGC'} >>> annos_out {'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}} 
0
source share

So, I think that the approach of trying to break down each sequence into genes and then remove the dash leads to a lot of unnecessary accounting. Instead, it would be easier to just look at the dash and then update all indexes based on their relative positions. Here I wrote a function that works correctly:

 from copy import copy def rewriteGenes(align, annos): alignments = copy(align) annotations = copy(annos) for sequence, alignment in alignments.items(): while alignment.find('-') > -1: index = alignment.find('-') for gene, (start, end) in annotations[sequence].items(): if index < start: annotations[sequence][gene][0] -= 1 if index <= end: annotations[sequence][gene][1] -= 1 alignment = alignment[:index] + alignment[index+1:] alignments[sequence] = alignment return (alignments, annotations) 

This is repeated with a dash in each alignment and updates the gene indices as they are removed.

Note that this gives the gene with indices [2,1] for the following test case:

 align = {"seq1":"ATGCATGC", "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} 

This is necessary because the way your indexes are set does not allow empty genes otherwise. For example, indices [2,2] would be a sequence of length 1, starting at index 2.

0
source share

All Articles