The longest common substring in two string sequences

Question

The longest common substring in two string sequences

Having studied the longest general substring algorithm, I was interested to learn about a specific version of the problem. It is described as follows:

Given two non-empty string sequences, X = (x1, x2, x3, .... x (n)) and Y = (y1, y2, y3, ..., y (m)), where x (i) and y (i) are character strings, find the longest string in X, which is a substring of all strings of Y.

I have a substring(x, y) function that returns a boolean depicting whether x is a substring in y or not. Obviously, I have to concatenate all the lines in Y in order to form one large line, let's say B. is indicated. I thought about the following approaches:

Naive . Start by combining all the lines in X to form the string A (n). Apply the substring (A (n), B) - this includes iterating backward in the string A (n). If true, the algorithm ends here and returns A (n) - or any part of it is included in the specified substring. If not, start using (A (n - 1), B) and so on. If such a string does not exist in X, I return an empty string.

Obviously, this approach will take some time depending on the implementation. Assuming that I am using an iterative approach, at each iteration I would have to iterate back through String at that level / index, and then apply substring (). This will take at least two loops, and O(size(B) * maxlength(x1, x2,...)) worst time or more depending on the substring () (correct me if not).

I thought of a second approach based on suffix trees / arrays.

Generic suffix tree . I will build the GST of the Y sequence using the Ukkonen algorithm in O(maxlength(y1, y2,...) (?). Lack of knowledge about the bites of suffix trees. I believe that the suffix tree approach will significantly reduce the execution time (due to space) for searching for a substring, but I don’t know how to implement this operation.

If there is a better approach, I would love to know.

EDIT: Sorry if it seemed to me that I had abandoned this topic.

What should I do if I should not use GST, but some standard data structure, such as a stack, queue, set, heap, priority queue, etc.? Naturally, the sequence X must be sorted, the largest row. If I store it in a string array, I will have to use a sorting algorithm such as mergesort / quicksort. The goal is the most efficient working time.

Can I store X in a structure that automatically sorts its elements as they are assembled? How about maximum heap?

It would seem that the suffix tree is the best way to find substrings this way. Is there any other data structure that I could use?

+7

string algorithm data-structures

Pritishc Oct 3 '13 at 11:27

source share

3 answers

Vishkey · Answer 1 · 2013-10-03T12:00:03+0000

First arrange the X array for the longest string. So the first line in X, which is a substring of all Y lines, is the solution.

A multiprocessor algorithm would be the best way to solve the problem of testing each line of X with all Y-lines quickly.

Bentoy13 · Answer 2 · 2013-10-03T12:07:43+0000

Here is my idea for solving your problem; I'm not sure about everything, so comments can improve it if you think it is worth the effort.

Start by calculating all the common substrings of all the strings in Y. First, take two lines and build a tree of all common substrings. Then, for each other line in Y, remove each substring from the map that does not appear on that line. The complexity is linear with the number of lines in Y, but I cannot figure out how many elements there can be in the tree, so I cannot make an estimate of the final complexity.

Then find the longest string in X, which is a substring of one in the tree.

Some improvements must be made to keep the tree as small as possible, for example, to contain only substrings that are not substrings of others.

Gareth rees · Answer 3 · 2013-10-03T12:16:15+0000

Writing | Y | for the number of lines in the set Y and len (Y) for their total length:

Process strings in Y in a generalized suffix tree (for example, using the Ukkonen algorithm ). It takes time O (len (Y)), assuming a constant size alphabet.
Mark each node in the suffix tree depending on whether the line identified by this node belongs to all lines in Y. It takes O (| Y | len (Y)) time.
For each line in X, find it in the suffix tree and see if the node is marked as belonging to all lines in Y. Output the longest marked line. It takes time O (len (X)).

Total time: O (| Y | len (Y)) + O (len (X)).

The longest common substring in two string sequences

More articles: