A regular expression that refers to a match from the previous part of the expression

I am looking for a regular expression that identifies a sequence in which an integer in the text indicates the number of inverse letters at the end of the expression. This specific example is used to identify insertions and exceptions in genetic data in pileup format.

For instance:

If the text I'm looking for is the following:

AtT+3ACGTTT-1AaTTa 

I need to map the insert and delete, which in this case are +3ACG and +3ACG . The integer (n) part can be any integer greater than 1, and I have to capture n trailing characters.

I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn] , but I cannot figure out how to get the exact number of trailing ACGTNs given by an integer.

Sorry if there is an obvious answer here, I was looking for a watch. Thanks!

(UPDATE)

I usually work in Python. The only workaround I could figure out with the re module in python is to call both the integers and the interval of each of them in / del and combine the two to extract the corresponding text length.

For instance:

 >>> import re >>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG' >>> expr = '[+-]?([0-9]+)[ACGTNacgtn]' >>> ints = re.findall(expr, a) #returns a list of the integers >>> spans = [i.span() for i in re.finditer(expr,a)] >>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))] >>> newspans >>> [(14, 17), (17, 20), (20, 26)] 

The resulting tuples allow me to cut indexes. This is probably not the best syntax, but it works!

+4
source share
3 answers

You can use regular expression override by passing a function as a replacement ... for example

 s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz" import re def dump(match): start, end = match.span() print s[start:end + int(s[start+1:end])] re.sub(r'[-+]\d+', dump, s) #output # +3fgh # -1m # +12abcdefghijkl 
+2
source

This is not possible; regular expressions cannot โ€œcountโ€.

But if you use a programming language that allows callbacks as a regular expression matching evaluator (like C #, PHP), then what could you do is have a regular expression like [+-]?([0-9]+)([ACGTNacgtn]+) , and in the callback, the desired length.

eg. for c #

 var regexMatches = new List<string>(); Regex theRegex = new Regex(@"[+-]?([0-9]+)([ACGTNacgtn]+)"); text = theRegex.Replace(text, delegate(Match thisMatch) { int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value); string trailingString = thisMatch.Groups[2].Value; if (numberOfInsertsOrDeletes > trailingString.Length) { trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); } regexMatches.Add(trailingString); return thisMatch.Groups[0].Value; }); 
0
source

A simple Perl pattern for matching an integer followed by that number of any character is simple:

  (\d+)(??{"." x $1}) 

which is pretty straight forward, I think you will agree. For example, this snippet:

 my $string = "AtT+3ACGTTT-1AaTTa"; print "Matched $&\n" while $string =~ m{ ( \d+ ) # capture an integer into $1 (??{ "." x $1 }) # interpolate that many dots back into pattern }xg; 

Fun deduces the expected

 Matched 3ACG Matched 1A 

EDIT

Oh, I see you have just added the Python tag since you started editing. Unfortunately. Well, maybe it will be useful for you anyway.

However, if what you are actually looking for is a fuzzy match where you allow a certain number of attachments and deletions (edit distance), then this will use the Matthew Barnetts regex library for Python. This is not like what you are doing, as inserts and deletes are actually represented on your lines.

But the Matthews library is really very nice and very interesting, and even does a lot of things that Perl can't do. :) Its a replacement for the Python re standard library.

0
source

All Articles