I am looking for a regular expression that identifies a sequence in which an integer in the text indicates the number of inverse letters at the end of the expression. This specific example is used to identify insertions and exceptions in genetic data in pileup format.
For instance:
If the text I'm looking for is the following:
AtT+3ACGTTT-1AaTTa
I need to map the insert and delete, which in this case are +3ACG and +3ACG . The integer (n) part can be any integer greater than 1, and I have to capture n trailing characters.
I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn] , but I cannot figure out how to get the exact number of trailing ACGTNs given by an integer.
Sorry if there is an obvious answer here, I was looking for a watch. Thanks!
(UPDATE)
I usually work in Python. The only workaround I could figure out with the re module in python is to call both the integers and the interval of each of them in / del and combine the two to extract the corresponding text length.
For instance:
>>> import re >>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG' >>> expr = '[+-]?([0-9]+)[ACGTNacgtn]' >>> ints = re.findall(expr, a) #returns a list of the integers >>> spans = [i.span() for i in re.finditer(expr,a)] >>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))] >>> newspans >>> [(14, 17), (17, 20), (20, 26)]
The resulting tuples allow me to cut indexes. This is probably not the best syntax, but it works!
source share