WARNING: the commenter גלעד ברקן points out that this algorithm gives the wrong answer 6 (higher than possible!) For line 1213213515. My implementation gets the same incorrect answer, so there seems to be a serious problem with this algorithm. I will try to find out what the problem is, but THIS ALGORITHM MUST NOT!
I thought of a solution that would take O (n ^ 3) time and O (n ^ 2) space that should be used for strings up to 1000 or so long. This is based on customizing the usual notion of longest common subsequences (LCS). For simplicity, I will describe how to find a substring of minimum length with the “hard double” property that starts at position 1 in the input line, which I assume is 2n long; just run this algorithm 2 times, each time, starting from the next position in the input line.
"Avoiding Avoidance" of Common Subsequences
If an input string S of length 2n has the "tight twin" (TT) property, then it has a common subsequence with itself (or, which is the same thing, two copies of S have a common subsequence):
- has length n and
- subject to an additional restriction that no character position in the first copy of S never coincides with the same character position in the second copy.
In fact, we can safely compress the last constraint to , no character position in the first copy of S will ever match the equal or lower position of the character in the second copy , due to the fact that we will search for substrings TT in increasing order of length and (as shows the lower part) in any substring TT of minimum length, you can always assign characters to two subsequences A and B, so for any matched pair (i, j) of positions in the substring with i <j, the character at position I is assigned A. Let me call such a common subsequence is a self-hazardous general subsequence (SACS).
The key point in making efficient calculations is that no SACS string of length-2n can contain more than n characters (since, obviously, you cannot insert more than 2 sets of n characters in a string of length-2n) therefore, if Since SACS of length-n exists, it should be as long as possible. So, to determine if S is TT or not, it is enough to find SACS of maximum length between S and itself and check whether it really has length n.
Dynamic Programming
Define f (i, j) as the length of the longest self-contained common subsequence of the prefix length-i S with the prefix length-j S. To actually calculate f (i, j), we can use a small modification of the usual dynamic programming formula LCS:
f(0, _) = 0 f(_, 0) = 0 f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j)) m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j . As with regular LCS DP, it takes O (n ^ 2) to calculate the time, since 2 arguments each range between 0 and n and the calculation required outside the recursive steps is O (1). (In fact, we only need to calculate the “upper triangle” of this DP matrix, since each cell (i, j) below the diagonal will dominate the corresponding cell (j, i) over it - although this does not change the asymptotic complexity.)
To determine if the prefix of length-2j is a string equal to TT, we need the maximum value f (i, 2j) for all 0 <= i <= 2n - that is, the largest value in column 2j of the matrix DP. This maximum can be calculated O (1) times per DP cell, writing down the maximum value seen so far and updating if necessary, as each DP cell in the column is calculated. Continuing in ascending order of j from j = 1 to j = 2n, we can fill in the matrix DP one column at a time, always processing shorter prefixes S before longer ones, so when processing column 2j we can safely assume that the prefix is not shorter TT (because if it were, we would have found it earlier and already finished).