Having studied the longest general substring algorithm, I was interested to learn about a specific version of the problem. It is described as follows:
Given two non-empty string sequences, X = (x1, x2, x3, .... x (n)) and Y = (y1, y2, y3, ..., y (m)), where x (i) and y (i) are character strings, find the longest string in X, which is a substring of all strings of Y.
I have a substring(x, y) function that returns a boolean depicting whether x is a substring in y or not. Obviously, I have to concatenate all the lines in Y in order to form one large line, let's say B. is indicated. I thought about the following approaches:
- Naive . Start by combining all the lines in X to form the string A (n). Apply the substring (A (n), B) - this includes iterating backward in the string A (n). If true, the algorithm ends here and returns A (n) - or any part of it is included in the specified substring. If not, start using (A (n - 1), B) and so on. If such a string does not exist in X, I return an empty string.
Obviously, this approach will take some time depending on the implementation. Assuming that I am using an iterative approach, at each iteration I would have to iterate back through String at that level / index, and then apply substring (). This will take at least two loops, and O(size(B) * maxlength(x1, x2,...)) worst time or more depending on the substring () (correct me if not).
I thought of a second approach based on suffix trees / arrays.
- Generic suffix tree . I will build the GST of the Y sequence using the Ukkonen algorithm in
O(maxlength(y1, y2,...) (?). Lack of knowledge about the bites of suffix trees. I believe that the suffix tree approach will significantly reduce the execution time (due to space) for searching for a substring, but I donβt know how to implement this operation.
If there is a better approach, I would love to know.
EDIT: Sorry if it seemed to me that I had abandoned this topic.
What should I do if I should not use GST, but some standard data structure, such as a stack, queue, set, heap, priority queue, etc.? Naturally, the sequence X must be sorted, the largest row. If I store it in a string array, I will have to use a sorting algorithm such as mergesort / quicksort. The goal is the most efficient working time.
Can I store X in a structure that automatically sorts its elements as they are assembled? How about maximum heap?
It would seem that the suffix tree is the best way to find substrings this way. Is there any other data structure that I could use?