Java Regex for Genome Puzzle

I was assigned the problem of searching for genes when specifying a string of letters A, C, G or T in a string, such as ATGCTCTCTTGATTTTTTTGGGGGTAGCCATGCACACACACACATAAGA. The gene starts with ATG and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). A gene consists of triplets of letters, therefore its length is a multiple of three, and none of these triplets can be the start / end triplets listed above. So, for the line above the genes, it contains CTCTCT and CACACACACACA. And actually my regex works for this particular line. Here is what I still have (and I'm very pleased with myself that I got to this):

(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA) 

However, if the other result has ATG and end-triplet and is not aligned with the triplets of this result, it fails. For example:

 Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG : TTGCTTATTGTTTTGAATGGGGTAGGA ACCTGC 

He should also find GGG, but not: TTGCTTATTGTTTTGA (ATG | GGG | TAG) GA

I'm new to regex in general and am a bit stuck ... just a little hint would be awesome!

+7
java regex
source share
4 answers

A regular expression is possible here:

 (?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA))) 

Small test setup:

 public class Main { public static void main(String[]args) { String source = "TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG"; Matcher m = Pattern.compile("(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))").matcher(source); System.out.println("source : "+source+"\nmatches:"); while(m.find()) { System.out.print(" "); for(int i = 0; i < m.start(); i++) { System.out.print(" "); } System.out.println(m.group(1)); } } } 

which produces:

 source : TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG matches: ATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGA ATGGGGTAG ATGACCTGCTAA ATGTAG 
+1
source share

The problem is that the regular expression consumes the characters that it matches, and then they are no longer used.

You can solve this using either a zero-width match (in this case, you will only get a match index, not matching characters).

Alternatively, you can use three similar regular expressions, but each one uses a different offset:

 (?=(.{3})+$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA) (?=(.{3})+.$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA) (?=(.{3})+..$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA) 

You may also consider using a different approach that does not include regular expressions, as the above regular expression will be slow.

+2
source share

The problem with such things is that you can slowly create a regular expression, rule by rule, until you have something that works.

Then your requirements change, and you need to start all over again, because for ordinary mortals it is almost impossible to easily rebuild a complex regular expression.

Personally, I would prefer to do this in the "old fashioned" way - use string manipulations. Each stage can be easily commented on, and if there is a slight change in the requirements, you can simply set up a specific stage.

+2
source share

Perhaps you should try other methods, such as working with indexes. Something like:

 public static final String genome="ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA"; public static final String start_codon = "ATG"; public final static String[] end_codons = {"TAA","TAG","TGA"}; public static void main(String[] args) { List<Integer>start_indexes = new ArrayList<Integer>(); int curIndex = genome.indexOf(start_codon); while(curIndex!=-1){ start_indexes.add(curIndex); curIndex = genome.indexOf(start_codon,curIndex+1); } } 

do the same for the other codons and see if the indices match the triplet rule. By the way, are you sure that the gene excludes the start codon? (some ATGs can be found in the gene)

0
source share

All Articles