Search for a set of repeating non-overlapping substrings of two input strings using suffix arrays

Input: two lines A and B.

Output: a set of repeating non-overlapping substrings

I need to find all duplicate lines, each of which should occur in both (!) Lines at least once. So, for example, let

A = "xyabcxeeeyabczeee" and B = "yxabcxabee".

Then the valid result would be {"abcx", "ab", "ee"}, but not "eee", since it occurs only on line A.

I think this problem is very related to the problem of "super-maximum repetition." Here is the definition:

Maximum repeating pair: A pair of identical substrings alpha and beta in S, so that the expansion of alpha and beta in any direction will destroy the equality of two lines. It is presented as a triplet (position1, position2, length)

Maximum repetition: "Substring S, which occurs in the maximum pair in S". Example: abc in S = xabcyiiizabcqabcyrxar. Note: there can be many maximum repeating pairs, but there can only be a limited number of maximum repeats.

Supermaximum Repeat "Maximum repeat that never occurs as a substring of any other maximum repeat" Example: abcy in S = xabcyiiizabcqabcyrxar.

The search algorithm for all supermaximal repeats is described in "Algorithms for strings, trees and sequences", but only for suffix trees.

Works: 1.) find all left-sized nodes using DFS

For each position, i in SS (i-1) is called the left character i. The left leaf symbol in T (S) is the left suffix position symbol represented by this leaf. An inner node v in T (S) is called a left manifold if at least two leaves in vs subtrees have different left characters.

2.) applying Theorem 7.12.4 on these nodes:

The diverse left inner node v is a supermaximal repeat of a if and only if all of v are leaves, each of which has a separate left character

Both lines A and B should probably be combined, and when we check v leaves in the second step, we must also impose an additional restriction that there must be at least one separate left character from lines A and B. This can be done by comparing their position against length A. If position (left character)> length (A), then the left character is in A, otherwise in B.

Can you help me solve this problem with the suffix + lcp arrays?

+4
source share
1 answer

It looks like you're looking for a lot of intersections of all the substrings of your two input lines. In this case, substrings with a single letter must also be returned. Let s1 and s2 be your lines, s1 be the shorter of the two. After thinking about this for a while, I don’t think you can become much better than the intuitive algorithm O (n ^ 3m) or O (n ^ 3), where n is the length s1 and m is the length s2. I don't think suffixes can help you here.

for(int i=0 to n-1){ for(int j=1 to ni){ if(contains(s2,substring(s1,i,j))) emit the substring } } 

Runtime comes from iterations of the loop (n ^ 2) / 2, each of which performs the worst case O (nm), contains an operation (possibly O (n) depending on the implementation). But actually this is not so bad, as it will be constant, much smaller than one of the parties, since the length of the substring will actually be in the range from 1 to n.
If you don't need single character matches, you can simply initialize j as 2 or something more.

BTW: Don't really create new lines with a substring, find / create a contains function that will take pointers and the source string, and just look at the characters between, including i, except j.

0
source

All Articles