This answer processes the updated annos dictionary from the comment on the cdlanes answer. This answer leaves the annos dictionary with the wrong index [2,1] for gene2 . My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL spaces in this region. It should also be noted that if a gene contains only one letter in the align final, then anno[geneX] will have the same indices for start and stop → See seq3 gene1 from your annos comment.
align = {"seq1":"ATGCATGC", "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} import re for name,anno in annos.items(): # indices of gaps removed usinig re removed = [(m.start(0)) for m in re.finditer(r'-', align[name])] # removes gaps from align dictionary align[name] = re.sub(r'-', '', align[name]) build_dna = '' for gene,inds in anno.items(): start_ind = len(build_dna)+1 #generator to sum the num '-' removed from gene num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1]) # build the de-gapped string build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "") end_ind = len(build_dna) if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps del annos[name][gene] #remove the gene entry continue #update the values in the annos dictionary annos[name][gene][0] = start_ind-1 annos[name][gene][1] = end_ind-1
Results:
In [3]: annos Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
The results of the 3rd annos gene are higher. Just replace the annos variable:
In [5]: annos3 Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]}, 'seq2': {'gene1': [0, 1], 'gene3': [2, 3]}, 'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}