General subsequence of a given length

What are some good ways to find all common subsequences of length k two lines?

Example:

s1 = AAGACC

s2 = AGATAACCAGGAGCTGC

all common subsequences of length 5: AAGAC AAACC AGACC AAGCC

+7
c string algorithm subsequence
source share
3 answers

One relatively simple way would be to recover sequences from the LCS matrix. The following is the O (n ^ 2 * k + x * n) algorithm, where x is the size of the output (i.e., the number of common subsequences of length k). This is in C ++, but it is quite easy to translate it to C:

 const int N = 100; int lcs[N][N]; set<tuple<string,int,int,int>> vis; string s1 = "AAGACC"; string s2 = "AGATAACCAGGAGCTGC"; void reconstruct(const string& res, int i, int j, int k) { tuple<string,int,int,int> st(res, i, j, k); if (vis.count(st)) return; vis.insert(st); if (lcs[i][j] < k) return; if (i == 0 && j == 0 && k == 0) { cout << res << endl; return; } if (i > 0) reconstruct(res, i-1, j, k); if (j > 0) reconstruct(res, i, j-1, k); if (i>0 && j>0 && s1[i-1] == s2[j-1]) reconstruct(string(1,s1[i-1]) + res, i-1, j-1, k-1); } int main() { lcs[0][0] = 0; for (int i = 0; i <= s1.size(); ++i) lcs[i][0] = 0; for (int j = 0; j <= s1.size(); ++j) lcs[0][j] = 0; for (int i = 0; i <= s1.size(); ++i) { for (int j = 0; j <= s2.size(); ++j) { if (i > 0) lcs[i][j] = max(lcs[i][j], lcs[i-1][j]); if (j > 0) lcs[i][j] = max(lcs[i][j], lcs[i][j-1]); if (i > 0 && j > 0 && s1[i-1] == s2[j-1]) lcs[i][j] = max(lcs[i][j], lcs[i-1][j-1] + 1); } } reconstruct("", s1.size(), s2.size(), 5); } 

There should also be an O (n * (k + x)) method based on a slightly different DP approach: let f (i, k) be the minimal index j such that lcs (i, j)> = k. We have recurrence

 f(i, 0) = 0 for all i f(i, k) = min{f(i-1, k), minimum j > f(i-1, k-1) such that s2[j] = s1[i]} 

We can also reconstruct sequences of length k from the matrix f.

+4
source share

Create a trie of all subsequences of a given length k from s1 , then go to s2 and check for each sequence of length k if it is in trie.

+1
source share

Here's a version of an algorithm that uses recursion, a stack of size k and includes two optimizations to skip characters that have already been noticed, and to skip subsequences that don't exist. Strings are not unique (there may be duplicates), so run the output through uniq .

 #include <stdio.h> #include <string.h> /* s1 is the full first string, s2 is the suffix of the second string * (starting after the subsequence at depth r), * pos is the positions of chars in s1, r is the recursion depth, * and k is the length of subsequences that we are trying to match */ void recur(char *s1, char *s2, size_t pos[], size_t r, size_t k) { char seen[256] = {0}; /* have we seen a character in s1 before? */ size_t p0 = (r == 0) ? 0 : pos[r-1] + 1; /* start at 0 or pos[r-1]+1 */ size_t p1 = strlen(s1) - k + r; /* stop at end of string - k + r */ size_t p; if (r == k) /* we made it, print the matching string */ { for (p=0; p<k; p++) putchar(s1[pos[p]]); putchar('\n'); return; } for (p=p0; p<=p1; p++) /* at this pos[], loop through chars to end of string */ { char c = s1[p]; /* get the next char in s1 */ if (seen[c]) continue; /* don't go any further if we've seen this char already */ seen[c] = 1; pos[r] = p; char *s = strchr(s2, c); /* is this char in s2 */ if (s != NULL) recur(s1, s+1, pos, r+1, k); /* recursively proceed with next char */ } } int main() { char *s1 = "AAGACC"; char *s2 = "AGATAACCAGGAGCTGC"; size_t k = 5; size_t pos[k]; if (strlen(s1) < k || strlen(s2) < k) /* make sure we have at least k chars in each string */ return 1; /* exit with error */ recur(s1, s2, pos, 0, k); return 0; } 

Output:

 AAGAC AAGCC AAACC AGACC 
0
source share

All Articles