Input: two lines A and B.
Output: a set of repeating non-overlapping substrings
I need to find all duplicate lines, each of which should occur in both (!) Lines at least once. So, for example, let
A = "xyabcxeeeyabczeee" and B = "yxabcxabee".
Then the valid result would be {"abcx", "ab", "ee"}, but not "eee", since it occurs only on line A.
I think this problem is very related to the problem of "super-maximum repetition." Here is the definition:
Maximum repeating pair: A pair of identical substrings alpha and beta in S, so that the expansion of alpha and beta in any direction will destroy the equality of two lines. It is presented as a triplet (position1, position2, length)
Maximum repetition: "Substring S, which occurs in the maximum pair in S". Example: abc in S = xabcyiiizabcqabcyrxar. Note: there can be many maximum repeating pairs, but there can only be a limited number of maximum repeats.
Supermaximum Repeat "Maximum repeat that never occurs as a substring of any other maximum repeat" Example: abcy in S = xabcyiiizabcqabcyrxar.
The search algorithm for all supermaximal repeats is described in "Algorithms for strings, trees and sequences", but only for suffix trees.
Works: 1.) find all left-sized nodes using DFS
For each position, i in SS (i-1) is called the left character i. The left leaf symbol in T (S) is the left suffix position symbol represented by this leaf. An inner node v in T (S) is called a left manifold if at least two leaves in vs subtrees have different left characters.
2.) applying Theorem 7.12.4 on these nodes:
The diverse left inner node v is a supermaximal repeat of a if and only if all of v are leaves, each of which has a separate left character
Both lines A and B should probably be combined, and when we check v leaves in the second step, we must also impose an additional restriction that there must be at least one separate left character from lines A and B. This can be done by comparing their position against length A. If position (left character)> length (A), then the left character is in A, otherwise in B.
Can you help me solve this problem with the suffix + lcp arrays?