Python Regular Expression Two Python Mismatches

How can I extend the code below to allow me to examine all instances where I have 2 inconsistencies or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example that includes only one possible inconsistency:

 >>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ' >>> re.findall(r'(?=(SSQP|[AZ]SQP|S[AZ]QP|SS[AZ]P|SSQ[AZ]))', s) ['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP'] 

Obviously, including the possibility of two inconsistencies in the above code would require a large set of brute force of all possible combinations.

How can I extend this code (or reorganize this code) to explore the possibility of two inconsistencies?

In addition, I want to change my output to get the returned numeric index (not SSQQ or SSQP ) of the exact position whose substring corresponds to the string.

+5
source share
2 answers

You do not need to use re here, you can use itertools and save a lot of memory.

You can first extract all substrings of length 4, and then compare them with your substring and simply select those that have less than 2 differences with your substring :

 from itertools import izip,islice,tee def sub_findre(s,substring,diffnumber): sublen=len(substring) zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s))) for z in zip_gen: l,z=tee(z) if sum(1 for i,j in l if i==j)>=sublen-diffnumber: new=izip(*z) next(new) yield ''.join(next(new)) 

Demo:

 s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ' substring='SSQP' print list(sub_findre(s,substring,2)) ['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ'] 

If you want to return the indexes, you need to put the indexes in izip , which you can use itertools.repeat() to repeat the index with the substring length:

 from itertools import izip,islice,tee,repeat def sub_findre(s,substring,diffnumber): sublen=len(substring) zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s))) for z in zip_gen: l,z=tee(z) if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber: new=izip(*z) next(new) next(new) yield next(new)[0] 

Demo:

 print list(sub_findre(s,substring,2)) [0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67] 
+5
source

A combinatorial explosion is not so bad for two out of four inconsistencies.

First, note that you can omit SSQP itself, as it covers all the milder cases.

 re.findall(r'(?=([AZ]SQP|S[AZ]QP|SS[AZ]P|SSQ[AZ]))', s) 

So the number of cases

  4! C(4, 1) = ––––––––––––– = 4 1! * (4 - 1)! 

For two inconsistencies, the number of cases

  4! C(4, 2) = ––––––––––––– = 6 2! * (4 - 2)! 

Namely,

 re.findall('(?=(SS..|SQ|S..P|.SQ.|.SP|..QP))', s) 

(To simplify the illustration, I took the responsibility of writing . Instead of [AZ] .)


To get match positions instead of match text:

 [match.start(0) for match in re.finditer('(?=(SS..|SQ|S..P|.SQ.|.SP|..QP))', s)] 
+2
source

All Articles