Re.finditer () returns the same value for the start and end methods

I am having problems with the re.finditer () method in python. For instance:

>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca' >>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)] out: [[22,22]] 

As you can see, the start() and end() methods give the same value. I noticed this before and just finished using m.start()+len(query_sequence) instead of m.end() , but I am very confused why this is happening.

+6
source share
4 answers

The regex module supports overlapping with finditer:

 import regex sequence = 'acaca' print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)] [0, 3], [2, 5]] 
+4
source
 sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca' print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)] 

remove lookahead . It does not capture only statements.

Conclusion: [[22, 31]]

if you need to use lookahead use

 sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca' print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)] 
+2
source

As indicated, you need to find matching matches and need a lookahead. However, you seem to know the exact string you are looking for. How about this?

 def find_overlapping(sequence, matchstr): for m in re.finditer('(?={})'.format(matchstr)): yield (m.start(), m.start() + len(matchstr)) 

Alternatively, you can use a third-party Python regex , as described here .

+1
source

If the length of the subsequence is unknown a-priori, you can use the corresponding group inside the lookahead and take its span :

 [m.span(1) for m in re.finditer(r'(?=(gatttaacg))',sequence)] == [(22,31)] 

eg. to search for all duplicate characters:

 [m.span(1) for m in re.finditer(r'(?=(([acgt])\2+))',sequence)] 
+1
source

All Articles