Simple line compression: removing consecutive substrings

Question

Simple line compression: removing consecutive substrings

I was recently asked this question in an interview with Amazon.

Given a string, remove consecutive repeating substrings from it. If there are several consecutive intersecting substrings, delete the largest of them.

To be clear, here are a few examples:

input: aabcccddeaaa
output: abcdea (Compression of consecutive repeated characters)
input: abababcdeee
output: abcde (Compress successively repeating substrings)
input: ababcdabcd
output: ababcd

(You can compress ' ab ' or ' abcd ', but since the length of ' abcd ' is longer, you prefer to compress more.)

I could not come up with an effective implementation, does anyone know a good approach to this?

Since this was an interview question, please refrain from using complex library functions.

+7

string algorithm

Siddhanjay godre Jul 13 '15 at 6:02

source share

4 answers

For string X we can find the largest consecutive repeating substring in O (n ^ 2) using the Z-algorithm , which Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from pat[i] which is also a prefix of pat ( Source )

For each suffix X start with i using the Z algorithm for this substring:

 int result = 0; for(int i = 0; i < X.length(); i++) int[]z = Z-algo(X.substring(i)); //this take O(n) for(int j = i + result + 1; j < X.length(); j++) if(z[j] >= j - i + 1) result = (j - i + 1);

~~Repeating the above procedure, until we can find any duplicate substring, we can get the algorithm O (n ^ 3).~~

Note : after re-reading the question, especially in the last example, I will find out that the actual repeating substrings are limited only to the original substrings. Thus, the time complexity can be reduced to O (n ^ 2 log n) using the maximum heap.

+2

Pham trung Jul 13 '15 at 6:23

source share

 pos=[] dstr={} final=[] x="ababcdabcdcde" for k in re.finditer(r"(?=(.+?)\1+)",x): #Find start of all overlapping strings pos.append(k.start()) i=0 for k in pos: #Find end of overlapping strings s=re.findall(r"^((.*)\2+)",x[k:]) dstr[i]=(k,len(s[0][0])) i=i+1 #print dstr.values() k=0 while k< len(dstr.values())-1: #remove smaller length overlapping result if dstr.values()[k+1][0]<dstr.values()[k][1]<dstr.values()[k+1][1]: pass else: final.append(dstr.values()[k][0]) k=k+1 if dstr.values()[k-1][0] in final: pass else: final.append(dstr.values()[k][0]) #print final for k in final: #remove strings s=re.sub(r"(.*)\1+",r"\1",x[k:]) x=x[:k]+s print x

This in python.Works is fine with the given input.

+1

vks Jul 13 '15 at 8:19

source share

There is a direct non-recursive O (n ^ 3). The main observation is this: suppose that there is a string "aabcbcabbc" if we only want to remove consecutive repetitions, if we first shorten the lines of length = 1, length = 2 seconds, etc., we can reduce it and the reduction will be optimal . Consequently

'aabcabbc' => 'abcbcabc' => 'abcabc' => 'abc'

Python Code:

 def strcompress(str): strlen = len(str) for size in range (1, strlen // 2): for i in range (0, strlen - 2 * size + 1): str1 = str[i:i+size] str2 = str[i+size:i+2*size] while str1 == str2: str = str[:i+size] + str[i+2*size:] strlen = len(str) if i + 2*size > strlen: break str2 = str[i+size:i+2*size] print("The compressed string is:" + str) return

Example:

 >>> strcompress("ababcdabcd") The compressed string is:abcd

EDIT: Fixed several bugs in the code. This should work for existing samples and the example I provided.

0

user1952500 Jul 13 '15 at 7:07

source share

Amit · Accepted Answer · 2015-07-13T06:49:08+0000

Without regex ... This recursive method works:

 var cases = ['aabcccddeaaa', 'abababcdeee', 'ababcdabcd']; function compress(str) { var len, sub, i, n; // if str is shorter than 2, can't be any repeating substrings if(str.length < 2) return str; // max meaningful length is str.length/2 (truncated to integer) for(len = (str.length / 2) | 0; len > 0; len--) { // search for a repeating substring of "len" length starting at index i for(i = 0; i + (len * 2) <= str.length; i++) { sub = str.substr(i, len); // if such a substring exists... if(str.indexOf(sub, i + len) == i + len) { // "count" how many occurences (advance index till beyond repeats) for(n = i + len * 2; str.indexOf(sub, n) == n; n += len); // return a string composed of the compressed part before the match + // the substring + the compressed part beyond the match return compress(str.substr(0, i)) + sub + compress(str.substr(n)); } } } // if nothing found, return original string return str; } alert(JSON.stringify(cases.map(compress)));

After a long discussion in the comments regarding the complexity of the algorithm, I decided to reorganize a bit and use the self- startsWith function and use it to count internal operations (complexity ...).

I took the opportunity of further optimization by minimizing the distribution of rows to a minimum, so that now recursion works with the whole row + start / end indices.

The code below generates output that includes the input string, the result, n ^ 2 (for comparison, O (n ^ 2)) and the actual internal account of the operation. I added a few edge cases to show how this works. I could not find the input that leads to the n ^ 2 score, all of them were lower.

 var cases = ['aabcccddeaaa', 'abababcdeee', 'ababcdabcd', 'aabaaabaab', '1', '111222', '123456789', '1234567899']; var innerCount; function startsWith(str, start, subStart, subLen) { var subEnd = subStart + subLen - 1; while(subStart <= subEnd) { innerCount++; if(str[start++] != str[subStart++]) return false; } return true; } function doCompress(str, maxLen, minIndex, maxIndex) { var len, sub, i, n; // if str is shorter than 2, can't be any repeating substrings if(maxIndex - minIndex + 1 < 2) return str.substring(minIndex, maxIndex + 1); for(len = maxLen; len > 0; len--) { // search for a repeating substring of "len" length starting at index i for(i = minIndex; i + (len * 2) <= maxIndex + 1; i++) { // if such a substring exists... if(startsWith(str, i + len, i, len)) { // "count" how many occurences (advance index till beyond repeats) for(n = i + len * 2; (n + len <= maxIndex + 1) && startsWith(str, n, i, len); n += len); // return a string composed of the compressed part before the match + // the substring + the compressed part beyond the match return (i > minIndex ? doCompress(str, len - 1, minIndex, i - 1) : '') + str.substr(i, len) + (n < maxIndex ? doCompress(str, len, n, maxIndex) : ''); } } } // if nothing found, return original string return str.substring(minIndex, maxIndex + 1); } function compress(str) { innerCount = 0; // max meaningful length is str.length/2 (truncated to integer) return { source: str, result: doCompress(str, (str.length / 2) | 0, 0, str.length - 1), 'n^2': str.length*str.length, innerCount: innerCount}; } alert(JSON.stringify(cases.map(compress), null, '\t'));

This solution has a time complexity of O (n ^ 2).

Simple line compression: removing consecutive substrings

More articles: