Given a string, find two identical subsequences with consecutive C ++ indices

Question

Given a string, find two identical subsequences with consecutive C ++ indices

I need to build an algorithm (not necessarily efficient) that sets the string, finds and prints two identical subsequences (for example, by printing, I mean color). Moreover, the union of the sets of indices of these two subsequences should be a set of consecutive natural numbers (a complete segment of integers).

In math, what I'm looking for is called the "tight twins" if it helps anything. (See document here (PDF) )

Let me give you some examples:

1) consider the line 231213231

It has two subsequences that I am looking for in the form of "123". To see it better, look at this image:

The first subsequence is marked by underscores, and the second by overlay. As you can see, they have all the necessary properties.

2) consider line 12341234

3) consider line 12132344.

Now it gets complicated:

4) consider the line: 13412342

It is also not so simple:

I think these examples explain quite well what I had in mind.

I thought for a long time about an algorithm that could do this, but without success.

For coloring, I wanted to use this piece of code:

using namespace std; HANDLE hConsole; hConsole = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleTextAttribute(hConsole, k);

where k is the color.

Any help, even clues, would be greatly appreciated.

+7

c ++ algorithm

Mateusz May 19 '16 at 22:50

source share

4 answers

WARNING: the commenter גלעד ברקן points out that this algorithm gives the wrong answer 6 (higher than possible!) For line 1213213515. My implementation gets the same incorrect answer, so there seems to be a serious problem with this algorithm. I will try to find out what the problem is, but THIS ALGORITHM MUST NOT!

I thought of a solution that would take O (n ^ 3) time and O (n ^ 2) space that should be used for strings up to 1000 or so long. This is based on customizing the usual notion of longest common subsequences (LCS). For simplicity, I will describe how to find a substring of minimum length with the “hard double” property that starts at position 1 in the input line, which I assume is 2n long; just run this algorithm 2 times, each time, starting from the next position in the input line.

"Avoiding Avoidance" of Common Subsequences

If an input string S of length 2n has the "tight twin" (TT) property, then it has a common subsequence with itself (or, which is the same thing, two copies of S have a common subsequence):

has length n and
subject to an additional restriction that no character position in the first copy of S never coincides with the same character position in the second copy.

In fact, we can safely compress the last constraint to , no character position in the first copy of S will ever match the equal or lower position of the character in the second copy , due to the fact that we will search for substrings TT in increasing order of length and (as shows the lower part) in any substring TT of minimum length, you can always assign characters to two subsequences A and B, so for any matched pair (i, j) of positions in the substring with i <j, the character at position I is assigned A. Let me call such a common subsequence is a self-hazardous general subsequence (SACS).

The key point in making efficient calculations is that no SACS string of length-2n can contain more than n characters (since, obviously, you cannot insert more than 2 sets of n characters in a string of length-2n) therefore, if Since SACS of length-n exists, it should be as long as possible. So, to determine if S is TT or not, it is enough to find SACS of maximum length between S and itself and check whether it really has length n.

Dynamic Programming

Define f (i, j) as the length of the longest self-contained common subsequence of the prefix length-i S with the prefix length-j S. To actually calculate f (i, j), we can use a small modification of the usual dynamic programming formula LCS:

 f(0, _) = 0 f(_, 0) = 0 f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j)) m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)

As you can see, the only difference is the additional condition && i < j . As with regular LCS DP, it takes O (n ^ 2) to calculate the time, since 2 arguments each range between 0 and n and the calculation required outside the recursive steps is O (1). (In fact, we only need to calculate the “upper triangle” of this DP matrix, since each cell (i, j) below the diagonal will dominate the corresponding cell (j, i) over it - although this does not change the asymptotic complexity.)

To determine if the prefix of length-2j is a string equal to TT, we need the maximum value f (i, 2j) for all 0 <= i <= 2n - that is, the largest value in column 2j of the matrix DP. This maximum can be calculated O (1) times per DP cell, writing down the maximum value seen so far and updating if necessary, as each DP cell in the column is calculated. Continuing in ascending order of j from j = 1 to j = 2n, we can fill in the matrix DP one column at a time, always processing shorter prefixes S before longer ones, so when processing column 2j we can safely assume that the prefix is not shorter TT (because if it were, we would have found it earlier and already finished).

+2

j_random_hacker May 20, '16 at 17:30

source share

Let the string length be N.

There are two approaches.

Approach 1. This approach is always exponential.

For each possible subsequence of length 1..N / 2, we list all occurrences of this subsequence. For each input, indicate the positions of all characters.

For example, for 123123 this should be:

 (1, ((1), (4))) (2, ((2), (5))) (3, ((3), (6))) (12, ((1,2), (4,5))) (13, ((1,3), (4,6))) (23, ((2,3), (5,6))) (123, ((1,2,3),(4,5,6))) (231, ((2,3,4))) (312, ((3,4,5)))

The last two are not needed because they appear only once.

One way to do this is to start with subsequences of length 1 (i.e. characters), then go to subsequences of length 2, etc. At each step, discard all subsequences that appear only once, since you do not need them.
Another way to do this is to check all 2 ** N binary strings of length N. Whenever a binary string has at most N / 2 "1" digits, add it to the table. At the end, we discard all subsequences that appear only once.

You now have a list of subsequences that appear more than once. For each subsequence, check all the pairs and see if such a pair forms a dense twin.

Approach 2. Look for tighter twins. For each substring N * (N-1) / 2, check that the length of the substring is equal and each character appears in it exactly as many times, and then, being its length L, check whether it contains two dense doubles of length L / 2. There are 2 ** L ways to split it, the simplest thing you can do is check everything. However, there are more interesting ways to search for tt

+1

user31264 May 19 '16 at 23:44

source share

I would like to approach this as a problem of dynamic programming / pattern matching. We deal with characters one at a time, from left to right, and we support a herd of Nondefinitive finite state machines / NDFAs that correspond to partial matches. We start with a single null match, and each character we expand each NDFA each time, and each NDFA, possibly, leads to the appearance of many children, and then deduplicates the result - so we need to minimize the state contained in the NDFA to set the size of the herd.

I think NDFA needs to remember the following:

1) To skip the run of k characters in front of the match area, he skipped.

2) Suffix, which is a p-character string representing characters not yet matched, which must be matched using overlines.

I think you can always assume that the p-character string should be matched with overlones, because you can always exchange surface and underscores in the answer if you change throughout the answer.

When you see a new character, you can expand NDFA in the following ways:

a) NDFA can add omissions without exception.

b) NDFA can always add a new character to its suffix, which may be null

c) NDFA with the character string p, the first character of which corresponds to the new character, can turn into NDFA with the character string p-1, which consists of the last characters p-1 of the old suffix. If the string now has zero length, you find a match, and you can decide what it was if you keep links from each NDFA to its parent.

I thought I could use a more accurate encoding, which would guarantee only the size of the polynomial herd, but I could not do this work, and I can not prove polynomial behavior here, but I notice that some cases of degenerate behavior are handled reasonably because they lead to several ways to get the same suffix.

+1

mcdowella May 20, '16 at 18:17

source share

גלעד ברקן · Accepted Answer · 2016-05-22T01:51:52+0000

Here's a simple recursion that checks for tight twins. When there is a duplicate, it splits the decision tree if the duplicate is still part of the first twin. You will have to run it on each substring of even length. Other optimizations for longer substrings may include hash tests for char counts, as well as matching non-duplicated parts of twin candidates (characters that appear only twice in the entire substring).

Function Explanation:

First, a hash is created with each character as a key and the index that it displays as values. Then we traverse the hash: if the number of characters is odd, the function returns false; and character indices with a number greater than 2 are added to the list of duplicates - characters, half of which belong to the same twin, but we do not know which one.

The basic rule of recursion is only an increase in i , when a match for it will be found later in the line, while maintaining a record of the selected matches ( js ) that i should skip without looking for a match. This works because if we find n/2 , the coincidence, in order, by the time j reaches the end, is basically just another way of saying that the string consists of dense twins.

JavaScript Code:

 function isTightTwins(s){ var n = s.length, char_idxs = {}; for (var i=0; i<n; i++){ if (char_idxs[s[i]] == undefined){ char_idxs[s[i]] = [i]; } else { char_idxs[s[i]].push(i); } } var duplicates = new Set(); for (var i in char_idxs){ // character with odd count if (char_idxs[i].length & 1){ return false; } if (char_idxs[i].length > 2){ for (let j of char_idxs[i]){ duplicates.add(j); } } } function f(i,j,js){ // base case positive if (js.size == n/2 && j == n){ return true; } // base case negative if (j > n || (n - j < n/2 - js.size)){ return false; } // i is not less than j if (i >= j) { return f(i,j + 1,js); } // this i is in the list of js if (js.has(i)){ return f(i + 1,j,js); // yet to find twin, no match } else if (s[i] != s[j]){ return f(i,j + 1,js); } else { // maybe it a twin and maybe it a duplicate if (duplicates.has(j)) { var _js = new Set(js); _js.add(j); return f(i,j + 1,js) | f(i + 1,j + 1,_js); // it a twin } else { js.add(j); return f(i + 1,j + 1,js); } } } return f(0,1,new Set()); } console.log(isTightTwins("1213213515")); // true console.log(isTightTwins("11222332")); // false

Given a string, find two identical subsequences with consecutive C ++ indices

"Avoiding Avoidance" of Common Subsequences

Dynamic Programming

More articles: