The length of the longest repeating line in a longer line

I work with a series of long strings (e.g. 'ABAABBFGGBHHSFAFDAFDAFDBB' ) that have different lengths. For each line, I would like to find the length of the longest consecutive occurrences of a particular substring (for example, 'AFD' , for which the answer in the above example is 3 ). Is there any elegant way to achieve this with MATLAB?

+5
source share
3 answers

Use regex:

 str = 'AFDABAABBFGGBHHSFAFDAFDAFDBB'; %// note: two occurrences of repeated 'AFD' substr = 'AFD'; %// sought substring r = regexp(str, ['(' substr ')+'], 'match'); lengths = cellfun(@numel, r)/numel(substr); result = max(lengths); 

You can increase speed by using 'length' instead of @numel , as suggested by Divakar :

 lengths = cellfun('length', r)/numel(substr); 

In this example

 lengths = 1 3 result = 3 
+4
source

Let the input and search strings be

 in_str = 'ABAAAFDAFDBBFGGBHHSFAFDAFDAFDBB' search_str = 'AFD' 

We can use strfind to get the starting indexes for the search string in the input string and from those that find consecutive search string groups -

 idx = strfind(in_str,search_str) grp_matches = diff(idx)==numel(search_str) 

So, now we have the β€œislands” of zeros and ones, where the islands of them represent the presence of consecutive grouped search bites. Then we must find the length of the islands, and the maximum length of the island - the desired result -

 df1 = diff([0 grp_matches 0]) %// Perform differentiation of matches 

The end of the islands will be designated β€œ-1” as a result of differentiation, and β€œ1” will mean the beginning of these islands. Thus, (find(df1==-1) - find(df1==1))+1 will be the length of the island. The end result will be max this -

 out = max(find(df1==-1) - find(df1==1))+1 

Summing up the discussion , all the code could be turned into a compact version, for example:

 df1 = diff([0 diff(strfind(in_str,search_str))==numel(search_str) 0]) out = max(find(df1==-1) - find(df1==1))+1 
+4
source

If you still do not know what the contents of the string you are looking for are:

  • Compare the first half of the line with the second half ( length: N/2 ), if it is equal, done. The longest string found.
  • Else: Perform bitwise autocorrelation of a subsample . Steps:
  • compare the window version (substring) of the entire string of length M < N/2 , performing bitwise autocorrelation and running AND on the result of each frame. If the selected subset is there, AND will result in 1 for the parts of the line where it is, and you are done, you have found the longest repeating substring.
  • If # 3 did not return a positive result ( no occurrence ), reduce the length M of the substring by 1: M = M - 1 and repeat # 3 , while no occurrence was found and M > 0 .
  • If there is absolutely no repetition of any substring , you will find out M = 0 times.
0
source

All Articles