The splitting of the nucleotide sequences in JS with regular expression

I am trying to split the nucleotide sequence into amino acid strings using a regular expression. I have to start a new line every time I enter the line “ATG”, but I do not want to actually stop the first match on “ATG”. A valid value is any string ordering from As, Cs, Gs, and Ts.

For example, given the input line: ATGAACATAGGACATGAGGAGTCA I should get two lines: ATGAACATAGGACATGAGGAGTCA (all this) and ATGAGGAGTCA (first match “ATG”). A string containing "ATG" n times should produce n results.

I thought that the expression / (?: [ACGT] *) (ATG) [ACGT] * / g will work, but it is not. If this cannot be done with a regex, it's easy to write code, but I always prefer an elegant solution if one is available.

+4
source share
5 answers

If you really want to use regular expressions, try the following:

var str = "ATGAACATAGGACATGAGGAGTCA", re = /ATG.*/g, match, matches=[]; while ((match = re.exec(str)) !== null) { matches.push(match); re.lastIndex = match.index + 3; } 

But be careful with exec and change the index. You can easily do this with an endless loop.

Otherwise, you can use indexOf to find indices and substr to get substrings:

 var str = "ATGAACATAGGACATGAGGAGTCA", offset=0, match=str, matches=[]; while ((offset = match.indexOf("ATG", offset)) > -1) { match = match.substr(offset); matches.push(match); offset += 3; } 
+2
source

I think you want

 var subStrings = inputString.split('ATG'); 

KISS :)

+1
source

Splitting a line before each ATG entry is simple, just use

 result = subject.split(/(?=ATG)/i); 

(?=ATG) is a positive statement meaning "Approve that you can match ATG starting at the current position in the line."

This will divide GGGATGTTTATGGGGATGCCC into GGG , ATGTTT , ATGGGG and ATGCCC .

So now you have an array of strings (in this case four). I would go and take them, discard the first one (this one will never contain and not start with ATG ), and then join the lines no. 2 + ... + n , then 3 + ... + n , etc., until you have exhausted the list.

Of course, this regular expression does not make any check as to whether the string contains only ACGT characters, since they correspond only to positions between characters, so this should be done before, i. e. that the input string matches /^[ACGT]*$/i .

+1
source

Since you want to capture from each “ATG” to the end, the split doesn't suit you. However, you can use replace and abuse the callback function:

 var matches = []; seq.replace(/atg/gi, function(m, pos){ matches.push(seq.substr(pos)); }); 
0
source

This is not a regular expression, and I don’t know, this is what you consider to be “elegant”, but ...

 var sequence = 'ATGAACATAGGACATGAGGAGTCA'; var matches = []; do { matches.push('ATG' + (sequence = sequence.slice(sequence.indexOf('ATG') + 3))); } while (sequence.indexOf('ATG') > 0); 

I'm not quite sure that this is what you are looking for. For example, with the input string ATGabcdefghijATGklmnoATGpqrs , this returns ATGabcdefghijATGklmnoATGpqrs , ATGklmnoATGpqrs and ATGpqrs .

0
source

Source: https://habr.com/ru/post/1312913/


All Articles