I was assigned the problem of searching for genes when specifying a string of letters A, C, G or T in a string, such as ATGCTCTCTTGATTTTTTTGGGGGTAGCCATGCACACACACACATAAGA. The gene starts with ATG and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). A gene consists of triplets of letters, therefore its length is a multiple of three, and none of these triplets can be the start / end triplets listed above. So, for the line above the genes, it contains CTCTCT and CACACACACACA. And actually my regex works for this particular line. Here is what I still have (and I'm very pleased with myself that I got to this):
(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
However, if the other result has ATG and end-triplet and is not aligned with the triplets of this result, it fails. For example:
Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG : TTGCTTATTGTTTTGAATGGGGTAGGA ACCTGC
He should also find GGG, but not: TTGCTTATTGTTTTGA (ATG | GGG | TAG) GA
I'm new to regex in general and am a bit stuck ... just a little hint would be awesome!
java regex
Swordbeard
source share