Efficient algorithm for finding matching substrings longer than 14 characters of text inside other text

I have long text (about 5 MB in size), and another text is called a template (about 2000 characters).

The challenge is to find the corresponding parts from the genome pattern, the length of which is 15 characters or more.

Example:

long text: ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT AND MUCH LONG TERM

template: ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG

I am looking for an efficient (and easy to understand and implementable) algorithm.

A bonus would be a way to implement this with only char -arrays in C ++, if at all possible.

+8
c ++ c string algorithm
source share
6 answers

Put it back , I'm going to live code:

void match_substring(const char *a, const char *b, int n) // n=15 in your case { int alen = strlen(a); // I'll leave all the null-checking and buffer-overrun business as an exercise to the reader int blen = strlen(b); for (int i=0; i<alen; i++) { for (int j=0; j<blen; j++) { for (int k; (i+k<alen) && (j+k<blen) && a[i+k]==b[i+k]; k++); if (k >= n) printf("match from (%d:%d) for %d bytes\n", i, j, k); } } } 
+2
source share

Here's one algorithm - I'm not sure if it has a name. This requires a "hash hash" - a (non-cryptographic) hash function that has the property that sets the hash of the sequence AB...C , to efficiently calculate the hash of the sequence B...CD .

  • Calculate the rolling hashes of the sequence pattern[0..14] , pattern[1..15] , pattern[2..16] ... and save each index in pattern in the hash table.

  • Calibrate the swinging haystack[0..14] hash haystack[0..14] and see if it is in the hash table. If so, compare haystack[0..14] with pattern[pos..pos+14] , where pos was extracted from the hash table.

  • From the rolling hash of haystack[0..14] , it is efficient to calculate the rolling hash of haystack[1..15] and see if it is in the hash table. Repeat until you reach the end of the haystack .

Please note that your 15-character strings have only 2,330 possible values, so your “hash function” can be a simple match to the value of a string processed as a 15-digit number-4, which is quickly calculated, has the property of a rolling hash and is unique.

+7
source share

One way would be to get an Aho-Corasick implementation and use it to create something that recognizes any of the 15-character chunks in the template, and then use this to search for text. With Aho-Korasik, the costs of the collector and the cost of the search are linear, so this should be practical.

+4
source share

If you use a good implementation of the C library (or even a mediocre one such as glibc, which has a good implementation of this function), strstr will work just fine. I heard there a new algorithm that is especially good for DNA (small alphabet), but I can’t find the link right now. In addition, 2way (which uses glibc) is optimal.

+1
source share

I would love to go to your library and check out Robert Sedgwick and Kevin Wayne's 4th Edition Algorithms. They have a whole chapter devoted to finding a substring. Also, it is probably worth checking out the algs4.cs.princeton.edu book website .

TL DR. If you decide, you can hack the substring search using char arrays in guaranteed time, linear to the input length. There are code samples in the book and on the Internet. Not much easier than that.

+1
source share

I think a "suffix tree" can solve this problem better with doing O (log n)

-one
source share

All Articles