I am trying to find all matching matches in a string that matches the following structure:
N- and then
P - followed by either
S or T - and then
P
I tried the following code:
import re protein = 'MLGVLVLGALALAGLGFPAPAEPQPGGSQCVEHDCFALYPGPATFLNASQICDGLRGHLM' +\ 'TVRSSVAADVISLLLNGDGGVGRRRLWIGLQLPPGCGDPKRLGPLRGFQWVTGDNNTSYS' motif = 'N[^P][ST][^P]' pattern = re.compile(motif) print [match.group(0) for match in re.finditer(pattern, protein)]
which results in ['NASQ', 'NNTS'] . This is not true because it skips NTSY at the end of the second line of protein , overlapping with NNTS .
Then I tried to use a negative lookahead as follows:
motif = 'N(?!P)[ST](?!P)'
but created ['NT'] because (I think), lookahead does not use any of the lines.
So, how do I get all matching matches to get the desired result ['NASQ', 'NNTS', 'NTSY'] ?
source share