How to combine dna sequence pattern

I have a problem finding an approach to solving this problem.

I / O sequences are as follows:

**input1 :** aaagctgctagag **output1 :** a3gct2ag2 **input2 :** aaaaaaagctaagctaag **output2 :** a6agcta2ag 

The opening sequence can be 10 ^ 6 characters and the largest continuous patterns will be considered.

For example, for input 2 "agctaagcta" the output will not be "agcta2gcta", but it will be "agcta2".

Any help was appreciated.

+15
algorithm matching sequence dna-sequence
Jun 01 '13 at 9:16 on
source share
3 answers

Algorithm Explanation:

  • Having a sequence S with symbols s (1), s (2), ..., s (N).
  • Let B (i) be the best compressed sequence with elements s (1), s (2), ..., s (i).
  • So, for example, B (3) will be the best compressed sequence for s (1), s (2), s (3).
  • We want to know what is B (N).

To find this, we continue induction. We want to calculate B (i + 1), knowing B (i), B (i-1), B (i-2), ..., B (1), B (0), where B (0) is empty sequences and B (1) = s (1). At the same time, this is a proof of the optimality of the solution .;)

To calculate B (i + 1), we will choose the best sequence among the candidates:

  • Sequences of candidates in which the last block has one element:

    B (i) s (i + 1) 1 B (i-1) s (i + 1) 2; only if s (i) = s (i + 1) B (i-2) s (i + 1) 3; only if s (i-1) = s (i) and s (i) = s (i + 1) ... B (1) s (i + 1) [i-1]; only if s (2) = s (3) and s (3) = s (4) and ... and s (i) = s (i + 1) B (0) s (i + 1) i = s (i + 1) i; only if s (1) = s (2) and s (2) = s (3) and ... and s (i) = s (i + 1)

  • Candidate sequences in which the last block has 2 elements:

    B (I-1) s (i) s (g + 1) 1 B (i-3) s (i) s (i + 1) 2; only if s (i-2) s (i-1) = s (i) s (i + 1) B (i-5) s (i) s (i + 1) 3; only if s (i-4) s (i-3) = s (i-2) s (i-1) and s (i-2) s (i-1) = s (i) s (i + 1 ) ...

  • Sequences of candidates in which the last block has 3 elements:

    ...

  • Sequences of candidates in which the last block has 4 elements:

    ...

    ...

  • Candidate sequences where the last block contains n + 1 elements:

    s (1) s (2) s (3) ......... s (g + 1)

For each opportunity, the algorithm stops when the sequence block is no longer repeated. And here it is.

The algorithm will be the same as in psude-c code:

 B(0) = "" for (i=1; i<=N; i++) { // Calculate all the candidates for B(i) BestCandidate=null for (j=1; j<=i; j++) { Calculate all the candidates of length (i) r=1; do { Candidadte = B([ij]*r-1) s(i-j+1)…s(i-1)s(i) r If ( (BestCandidate==null) || (Candidate is shorter that BestCandidate)) { BestCandidate=Candidate. } r++; } while ( ([ij]*r <= i) &&(s(ij*r+1) s(ij*r+2)…s(ij*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j)) } B(i)=BestCandidate } 

Hope this can help a little more.

The following is a complete C program that performs the required task. It works in O (n ^ 2). The central part is only 30 lines of code.

EDIT I changed the code a bit, changed the variable names and added a comment to be more readable.

 #include <stdio.h> #include <stdlib.h> #include <string.h> #include <limits.h> // This struct represents a compressed segment like atg4, g3, agc1 struct Segment { char *elements; int nElements; int count; }; // As an example, for the segment agagt3 elements would be: // { // elements: "agagt", // nElements: 5, // count: 3 // } struct Sequence { struct Segment lastSegment; struct Sequence *prev; // Points to a sequence without the last segment or NULL if it is the first segment int totalLen; // Total length of the compressed sequence. }; // as an example, for the sequence agt32ta5, the representation will be: // { // lastSegment:{"ta" , 2 , 5}, // prev: @A, // totalLen: 8 // } // and A will be // { // lastSegment{ "agt", 3, 32}, // prev: NULL, // totalLen: 5 // } // This function converts a sequence to a string. // You have to free the string after using it. // The strategy is to construct the string from right to left. char *sequence2string(struct Sequence *S) { char *Res=malloc(S->totalLen + 1); char *digits="0123456789"; int p= S->totalLen; Res[p]=0; while (S!=NULL) { // first we insert the count of the last element. // We do digit by digit starting with the units. int C = S->lastSegment.count; while (C) { p--; Res[p] = digits[ C % 10 ]; C /= 10; } p -= S->lastSegment.nElements; strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements); S = S ->prev; } return Res; } // Compresses a dna sequence. // Returns a string with the in sequence compressed. // The returned string must be freed after using it. char *dnaCompress(char *in) { int i,j; int N = strlen(in);; // Number of elements of a in sequence. // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters. // What we want to return is B[N]; struct Sequence *B; B = malloc((N+1) * sizeof (struct Sequence)); // We first do an initialization for i=0 B[0].lastSegment.elements=""; B[0].lastSegment.nElements=0; B[0].lastSegment.count=0; B[0].prev = NULL; B[0].totalLen=0; // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth, We will try different sequences and keep the minimum one. for (i=1; i<=N; i++) B[i].totalLen = INT_MAX; // A very high value for (i=1; i<=N; i++) { // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0] for (j=1; j<=i; j++) { // Here we will check all the candidates where the last segment has j elements int r=1; // number of times the last segment is repeated int rNDigits=1; // Number of digits of r int rNDigitsBound=10; // We will increment r, so this value is when r will have an extra digit. // when r = 0,1,...,9 => rNDigitsBound = 10 // when r = 10,11,...,99 => rNDigitsBound = 100 // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on. do { // Here we analitze a candidate B(i). // where the las segment has j elements repeated r times. int CandidateLen = B[ij*r].totalLen + j + rNDigits; if (CandidateLen < B[i].totalLen) { B[i].lastSegment.elements = in + i - j*r; B[i].lastSegment.nElements = j; B[i].lastSegment.count = r; B[i].prev = &(B[ij*r]); B[i].totalLen = CandidateLen; } r++; if (r == rNDigitsBound ) { rNDigits++; rNDigitsBound *= 10; } } while ( (i - j*r >= 0) && (strncmp(in + i -j, in + i - j*r, j)==0)); } } char *Res=sequence2string(&(B[N])); free(B); return Res; } int main(int argc, char** argv) { char *compressedDNA=dnaCompress(argv[1]); puts(compressedDNA); free(compressedDNA); return 0; } 
+10
Jun 03 '13 at 18:11
source share

Forget Done. Dynamic programming. With a three-dimensional table:

  • sequence position
  • subsequence size
  • number of segments

TERMINOLOGY: For example, having a = "aaagctgctagag" , the position coordinate of the sequence will be from 1 to 13. At the position of sequence 3 (letter "g"), having a size of 4 subsequences, the subsequence will be "gctg". Got it? As for the number of segments, the expression "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2", consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" will consist of 4 segments. Got it? Now you are starting to fill the 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it in millions of sequences? I think a greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And I would advise you to expand your task to approximate coincidences, and you can publish it directly in the journal.

In any case, you will begin to fill out the subsequence compression table, starting from a certain position (size 1) with a length equal to the coordinate of size 2 with no more than 3 measurement segments. Thus, you first fill in the first line, which is a compression of subsequences of length 1, consisting of no more than 1 segment:

 aaagctgctagag 1(a1) 1(a1) 1(a1) 1(g1) 1(c1) 1(t1) 1(g1) 1(c1) 1(t1) 1(a1) 1(g1) 1(a1) 1(g1) 

Number is the cost of the character (always 1 for these trivial sequences is 1 char, number 1 is not taken into account in the cost of the character), and in brackets you have compression (also trivial for this simple case). The second line will be even simple:

 2(a2) 2(a2) 2(ag1) 2(gc1) 2(ct1) 2(tg1) 2(gc1) 2(ct1) 2(ta1) 2(ag1) 2(ga1) 2(ag1) 

There is only one way to decompose a 2-character sequence into 2 subsequences - 1 character + 1 character. If they are identical, the result looks like a + a = a2 . If they are different, for example a + g , then, since only 1-segment sequences are permissible, the result cannot be a1g1 , but must be ag1 . The third line will finally be more interesting:

 2(a3) 2(aag1) 3(agc1) 3(gct1) 3(ctg1) 3(tgc1) 3(gct1) 3(cta1) 3(tag1) 3(aga1) 3(gag1) 

Here you can always choose between two ways to compose a compressed line. For example, aag can be composed either as aa + g or a + ag . But again, we cannot have 2 segments, as in aa1g1 or a1ag1 , so we must be satisfied with aag1 if both components do not consist of the same symbol, as in aa + a => a3 , and the cost of the symbol is 2. We can continue on the 4th line:

 4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2) 

Here, in the first position, we cannot use a3g1 , because only 1 segment is allowed on this layer. But in the last position, compression to the cost of symbol 3 is performed using ag1 + ag1 = ag2 . Thus, it is possible to fill out the entire table of the first level up to a single subsequence of 13 characters, and each subsequence will have its optimal symbol cost and its compression under the restriction of the first level having no more than 1 segment.

Then you go to the second level, where 2 segments are allowed. And again, from bottom to top, you determine the optimal cost and compression of each coordinate of the table under the given limit of the level counter, comparing all possible methods of compiling a subsequence using the already calculated positions, until you completely fill out the table and, therefore, do not calculate the global optimum. There are some details that need to be resolved, but unfortunately I'm not going to write this for you.

+2
Jun 02 '13 at 5:26
source share

An attempt of my own path for a while, my devotion to jbaylina for its beautiful algorithm and implementation of C. Here is my attempt at the jbaylina algorithm in Haskell, and below is the further development of my attempt at a linear time algorithm that tries to compress segments that include repeating patterns in sequence:

 import Data.Map (fromList, insert, size, (!)) compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n where n = length s fbi = insert (size b) bestCandidate b where add (sequence, sLength) (sequence', sLength') = (sequence ++ sequence', sLength + sLength') j' = [1..min 100 i] bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j' combCandidates j candidate' = let nextCandidate' = comb 2 (b!(i - j + 1) `add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1)) in if snd nextCandidate' <= snd candidate' then nextCandidate' else candidate' where comb r candidate | r > uBound = candidate | not (strcmp r True) = candidate | snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate | otherwise = comb (r + 1) candidate where uBound = div (i + 1) j prev = b!(i - r * j + 1) nextCandidate = prev `add` ((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r)) strcmp 1 _ = True strcmp num bool | (take j . drop (i - num * j + 1) $ s) == (take j . drop (i - (num - 1) * j + 1) $ s) = strcmp (num - 1) True | otherwise = False 

Output:

 *Main> compress "aaagctgctagag" ("a3gct2ag2",9) *Main> compress "aaabbbaaabbbaaabbbaaabbb" ("aaabbb4",7) 


Linear Attempt:

 import Data.List (sortBy) group' xxs sAccum (chr, count) | null xxs = if null chr then singles else if count <= 2 then reverse sAccum ++ multiples ++ "1" else singles ++ if null chr then [] else chr ++ show count | [x] == chr = group' xs sAccum (chr,count + 1) | otherwise = if null chr then group' xs (sAccum) ([x],1) else if count <= 2 then group' xs (multiples ++ sAccum) ([x],1) else singles ++ chr ++ show count ++ group' xs [] ([x],1) where x:xs = xxs singles = reverse sAccum ++ (if null sAccum then [] else "1") multiples = concat (replicate count chr) sequences ws strIndex maxSeqLen = repeated' where half = if null . drop (2 * maxSeqLen - 1) $ ws then div (length ws) 2 else maxSeqLen repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated in (sequence,(sequenceStart, sequenceEnd')) repeated = foldr divide ([],(strIndex,strIndex),False) [1..half] equalChunksOf ta = takeWhile(==t) . map (take a) . iterate (drop a) divide chunkSize b@(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = let t = take (2*chunkSize) ws t' = take chunkSize t in if t' == drop chunkSize t then let ts = equalChunksOf t' chunkSize ws lenTs = length ts sequenceEnd = strIndex + lenTs * chunkSize newEnd = if sequenceEnd > sequenceEnd' then sequenceEnd else sequenceEnd' in if chunkSize > 1 then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs) then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True) else b else if notSinglesFlag then b else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False) else b addOne ab | null (fst b) = a | null (fst a) = b | otherwise = let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a (((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b in if sStart' < sEnd && sEnd < sEnd' then let c = ((start,end,patLen,lenS),sequence):rest d = ((start',end',patLen',lenS'),sequence'):rest' in (c ++ d, (sStart, sEnd')) else a segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where segment' zzs@(z:zs) strIndex farthest | null zs = initial | strIndex >= farthest && strIndex > 0 = ([],(0,0)) | otherwise = addOne initial next where next@(s',(start',end')) = segment' zs (strIndex + 1) farthest' farthest' | null s = farthest | otherwise = if start /= end && end > farthest then end else farthest initial@(s,(start,end)) = sequences zzs strIndex maxSeqLen areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a) combs [] r = [r] combs (x:xs) r | null r = combs xs (x:r) ++ if null xs then [] else combs xs r | otherwise = if areExclusive (head r) x then combs xs (x:r) ++ combs xs r else if l' > lowerBound then combs xs (x: reduced : drop 1 r) ++ combs xs r else combs xs r where lowerBound = l + 2 * patLen ((l,u,patLen,lenS),s) = head r ((l',u',patLen',lenS'),s') = x reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u lenReduced = length reduce reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s) buildString origStr sequences = buildString' origStr sequences 0 (0,"",0) where buildString' origStr sequences index accum@(lenC,cStr,lenOrig) | null sequences = accum | l /= index = buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l') | otherwise = buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u') where l' = l - index u' = u - ls' = s ++ show lenS (((l,u,patLen,lenS),s):rest) = sequences compress [] _ accum = reverse accum ++ (if null accum then [] else "1") compress zzs@(z:zs) maxSeqLen accum | null (fst segment') = compress zs maxSeqLen (z:accum) | (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum) | otherwise = reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1") ++ compressedStr ++ compress (drop lengthOriginal zzs) maxSeqLen [] where segment'@(s,(start,end)) = segment zzs 0 maxSeqLen combinations = combs (fst $ segment') [] takeWhile' xxs count | null xxs = 0 | x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count | not (null (reads [x]::[(Int,String)])) = 0 | otherwise = takeWhile' xs (count + 1) where x:xs = xxs f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') = let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig) ((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig') in if g == EQ then compare (takeWhile' cStr' 0) (takeWhile' cStr 0) else g (lenCompressed,compressedStr,lengthOriginal) = head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations)) 

Output:

 *Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 [] "a9b9a9b9" *Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 [] "aaabbb4" 
+2
Jun 02 '13 at 5:59
source share



All Articles