Find matching regular expression matches

Question

Find matching regular expression matches

I am trying to find all matching matches in a string that matches the following structure:

N
and then P
followed by either S or T
and then P

I tried the following code:

 import re protein = 'MLGVLVLGALALAGLGFPAPAEPQPGGSQCVEHDCFALYPGPATFLNASQICDGLRGHLM' +\ 'TVRSSVAADVISLLLNGDGGVGRRRLWIGLQLPPGCGDPKRLGPLRGFQWVTGDNNTSYS' motif = 'N[^P][ST][^P]' pattern = re.compile(motif) print [match.group(0) for match in re.finditer(pattern, protein)]

which results in ['NASQ', 'NNTS'] . This is not true because it skips NTSY at the end of the second line of protein , overlapping with NNTS .

Then I tried to use a negative lookahead as follows:

 motif = 'N(?!P)[ST](?!P)'

but created ['NT'] because (I think), lookahead does not use any of the lines.

So, how do I get all matching matches to get the desired result ['NASQ', 'NNTS', 'NTSY'] ?

+4

python regex regex-lookarounds

Biogeek Aug 19 '15 at 11:25

source share

1 answer

Maroun · Accepted Answer · 2015-08-19T11:30:37+0000

You should use the following regular expression:

 >>> matches = re.findall('(?=(N[^P][ST][^P]))', protein) >>> [match for match in matches] ['NASQ', 'NNTS', 'NTSY']

You can also use the regex module:

 matches = regex.findall(r'N[^P][ST][^P]', protein, overlapped=True)

Find matching regular expression matches

More articles: