Search for a substring from a string

Input: string S = AAGATATGATAGGAT.

Output: Maximum repetitions, such as GATA (as in positions 3 and 8), GAT (as in positions 3, 8 and 13), etc.

  • The maximum repetition is the substring t occurs k> 1 time in S, and if t propagates left or right, it will be less than k times.

  • The inner descendants of the leaf node are suffixes, each of which has a left character.

  • If the left characters of all descendants of the leaves are not all the same, it is called the "left variety" node.

  • Maximum repetitions are left various internal nodes.

General idea:

  • Create a suffix tree, and then do DFS (depth first search) on the tree

  • For each sheet, mark it with your left symbol.

  • For each internal node:

  • If at least one child is marked with *, then name him *

  • Otherwise, if his baby labels are diverse, stick *.

  • Then all the children have the same label, copy it to the current node

Is the above idea correct? What should be the pseudo code? Then I can try to write programming myself.

+8
language-agnostic algorithm suffix-tree
source share
1 answer

Your idea is good, but with the suffix tree you can make something even simpler.

Let T be the suffix tree of your sequence.

Let x be a node in T, T_x is a subtree of T with root x.

Let N_x be the number of leaves in T_x

Let the word (x) be the word created by crossing T from the root to node x

Now, using the definition of the suffix tree, we get:

The number of repetitions of the word (x) = N_x, and the position of these words is the label of each sheet

The algorithm for this would be a basic traversal of the tree, for each node in the tree, calculate N_x, if N_x> 2 add this to your result (if you want the position also you could add a label for each sheet)

Pseudocode:

:

mySequence

exit:

Result (list of words that repeat with score and position)

Algorithm:

T = suffixTree (mySequence)

For each internal node X in T:

T_X = subTree(T) N_X = Number of lead (T_X) if N_X >=2 : Result .add ( [word(X), N_X , list(label of leafs)] ) 

return result

Example:

take the wikipedia example for suffix trees: "BANANA" :

enter image description here

we get:

 N_A = 3 so "A" repeats 3 times in position {1,3,5} N_N=2 so "N" repeats 2 times in position {2,4} N_NA=2 so "NA" repeats 2 times in position {2,4} 

I found this article that seems to apply to your problem the same way you do, so yes, I think your method writes:

Repeating spelling repeating or general motifs using a suffix tree

Eject

This paper presents two algorithms. The first excerpt is the repeating motifs from the sequence defined over the Sigma alphabet. For an instance, Sigma can be equal to {A, C, G, T} and the sequences are coding for a DNA macromolecule. The search for motives corresponds to words in the same alphabet that have a minimum number q times in a sequence with no more than e mismatches each time (q is called a quorum restriction). [...]

You can download it and look at it, the author gives pseudo-code for your algorithm.

Hope this helps

+3
source share

All Articles