Find matching regular expression matches

I am trying to find all matching matches in a string that matches the following structure:

  • N
  • and then P
  • followed by either S or T
  • and then P

I tried the following code:

 import re protein = 'MLGVLVLGALALAGLGFPAPAEPQPGGSQCVEHDCFALYPGPATFLNASQICDGLRGHLM' +\ 'TVRSSVAADVISLLLNGDGGVGRRRLWIGLQLPPGCGDPKRLGPLRGFQWVTGDNNTSYS' motif = 'N[^P][ST][^P]' pattern = re.compile(motif) print [match.group(0) for match in re.finditer(pattern, protein)] 

which results in ['NASQ', 'NNTS'] . This is not true because it skips NTSY at the end of the second line of protein , overlapping with NNTS .

Then I tried to use a negative lookahead as follows:

 motif = 'N(?!P)[ST](?!P)' 

but created ['NT'] because (I think), lookahead does not use any of the lines.

So, how do I get all matching matches to get the desired result ['NASQ', 'NNTS', 'NTSY'] ?

+4
source share
1 answer

You should use the following regular expression:

 >>> matches = re.findall('(?=(N[^P][ST][^P]))', protein) >>> [match for match in matches] ['NASQ', 'NNTS', 'NTSY'] 

You can also use the regex module:

 matches = regex.findall(r'N[^P][ST][^P]', protein, overlapped=True) 
+2
source

All Articles