Generate all unique substrings for a given string

Given the string s , what is the fastest method for generating a set of all its unique substrings?

Example: for str = "aba" we get substrs={"a", "b", "ab", "ba", "aba"} .

A naive algorithm would be to traverse the entire line producing substrings of length 1..n in each iteration, getting the upper bound O(n^2) .

Is a better border possible?

(this is technically homework, so only pointers are welcome)

+59
language-agnostic algorithm
Apr 01 2018-10-01T00:
source share
14 answers

According to other posters, there are potential substrings O (n ^ 2) for a given string, so printing them cannot be completed faster than this. However, there is an effective representation of a set that can be constructed in linear time: a suffix tree .

+42
Apr 01 '10 at
source share

It is impossible to do this faster than O (n 2 ), because the line contains the total number of substrings O (n 2 ), so if you have to generate all of them, their number will be n(n + 1) / 2 in the worst case, therefore, the lower bound is <(→ 2 ) Ω (n 2 ).

+13
Apr 01 '10 at 12:40
source share

The first is brute force, which has O (N ^ 3) complexity, which can be reduced to O (N ^ 2 log (N)) The second using a HashSet, which has O (N ^ 2) complexity, and the Third, using LCP, initially I find the whole suffix of this line, which has the worst case O (N ^ 2) and the best case O (N Log (N)).

First decision: -

 import java.util.Scanner; public class DistinctSubString { public static void main(String[] args) { Scanner in = new Scanner(System.in); System.out.print("Enter The string"); String s = in.nextLine(); long startTime = System.currentTimeMillis(); int L = s.length(); int N = L * (L + 1) / 2; String[] Comb = new String[N]; for (int i = 0, p = 0; i < L; ++i) { for (int j = 0; j < (L - i); ++j) { Comb[p++] = s.substring(j, i + j + 1); } } /* * for(int j=0;j<N;++j) { System.out.println(Comb[j]); } */ boolean[] val = new boolean[N]; for (int i = 0; i < N; ++i) val[i] = true; int counter = N; int p = 0, start = 0; for (int i = 0, j; i < L; ++i) { p = L - i; for (j = start; j < (start + p); ++j) { if (val[j]) { //System.out.println(Comb[j]); for (int k = j + 1; k < start + p; ++k) { if (Comb[j].equals(Comb[k])) { counter--; val[k] = false; } } } } start = j; } System.out.println("Substrings are " + N + " of which unique substrings are " + counter); long endTime = System.currentTimeMillis(); System.out.println("It took " + (endTime - startTime) + " milliseconds"); } } 



Second solution: -

 import java.util.*; public class DistictSubstrings_usingHashTable { public static void main(String args[]) { // create a hash set Scanner in = new Scanner(System.in); System.out.print("Enter The string"); String s = in.nextLine(); int L = s.length(); long startTime = System.currentTimeMillis(); Set<String> hs = new HashSet<String>(); // add elements to the hash set for (int i = 0; i < L; ++i) { for (int j = 0; j < (L - i); ++j) { hs.add(s.substring(j, i + j + 1)); } } System.out.println(hs.size()); long endTime = System.currentTimeMillis(); System.out.println("It took " + (endTime - startTime) + " milliseconds"); } } 



Third solution: -

 import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.Arrays; public class LCPsolnFroDistinctSubString { public static void main(String[] args) throws IOException { BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); System.out.println("Enter Desired String "); String string = br.readLine(); int length = string.length(); String[] arrayString = new String[length]; for (int i = 0; i < length; ++i) { arrayString[i] = string.substring(length - 1 - i, length); } Arrays.sort(arrayString); for (int i = 0; i < length; ++i) System.out.println(arrayString[i]); long num_substring = arrayString[0].length(); for (int i = 0; i < length - 1; ++i) { int j = 0; for (; j < arrayString[i].length(); ++j) { if (!((arrayString[i].substring(0, j + 1)).equals((arrayString)[i + 1] .substring(0, j + 1)))) { break; } } num_substring += arrayString[i + 1].length() - j; } System.out.println("unique substrings = " + num_substring); } } 

Fourth solution: -

  public static void printAllCombinations(String soFar, String rest) { if(rest.isEmpty()) { System.out.println(soFar); } else { printAllCombinations(soFar + rest.substring(0,1), rest.substring(1)); printAllCombinations(soFar , rest.substring(1)); } } Test case:- printAllCombinations("", "abcd"); 
+7
Sep 04 '13 at 20:46
source share

For big oh ... The best you could do would be O (n ^ 2)

There is no need to reinvent the wheel, it is not based on lines, but on sets, so you have to accept these concepts and apply them to your own situation.

Algorithms

Really nice white paper from MS

In depth PowerPoint

Blog in perms lines

+3
Apr 01 '10 at
source share

since there are potentially n*(n+1)/2 different substrings (+1 for an empty substring), I doubt that you can be better than O(n*2) (worst case). The easiest way is to generate them and use the beautiful O(1) lookup table (for example, a hash map) to eliminate duplicates immediately when you find them.

+2
Apr 01 '10 at
source share
 class SubstringsOfAString { public static void main(String args[]) { String string = "Hello", sub = null; System.out.println("Substrings of \"" + string + "\" are :-"); for (int i = 0; i < string.length(); i++) { for (int j = 1; j <= string.length() - i; j++) { sub = string.substring(i, j + i); System.out.println(sub); } } } } 
+2
Sep 08 '16 at 6:29
source share
 class program { List<String> lst = new List<String>(); String str = "abc"; public void func() { subset(0, ""); lst.Sort(); lst = lst.Distinct().ToList(); foreach (String item in lst) { Console.WriteLine(item); } } void subset(int n, String s) { for (int i = n; i < str.Length; i++) { lst.Add(s + str[i].ToString()); subset(i + 1, s + str[i].ToString()); } } } 
+1
Nov 28 '15 at 6:49
source share

This can only be done in o (n ^ 2), since the total number of unique substrings of the string will be n (n + 1) / 2.

Example:

string s = "abcd"

pass 0: (all lines are 1 in length)

a, b, c, d = 4 lines

pass 1: (all lines are 2 in length)

ab, bc, cd = 3 lines

pass 2: (all lines have a length of 3)

abc, bcd = 2 lines

pass 3: (all lines are 4 in length)

abcd = 1 line

Using this analogy, we can write a solution with o (n ^ 2) time complexity and constant spatial complexity.

The source code is as follows:

 #include<stdio.h> void print(char arr[], int start, int end) { int i; for(i=start;i<=end;i++) { printf("%c",arr[i]); } printf("\n"); } void substrings(char arr[], int n) { int pass,j,start,end; int no_of_strings = n-1; for(pass=0;pass<n;pass++) { start = 0; end = start+pass; for(j=no_of_strings;j>=0;j--) { print(arr,start, end); start++; end = start+pass; } no_of_strings--; } } int main() { char str[] = "abcd"; substrings(str,4); return 0; } 
0
Mar 10 '14 at 23:09
source share

Here is my Python code. It generates all possible substrings of any string.

 def find_substring(str_in): substrs = [] if len(str_in) <= 1: return [str_in] s1 = find_substring(str_in[:1]) s2 = find_substring(str_in[1:]) substrs.append(s1) substrs.append(s2) for s11 in s1: substrs.append(s11) for s21 in s2: substrs.append("%s%s" %(s11, s21)) for s21 in s2: substrs.append(s21) return set(substrs) 

If you pass the str_ = "abcdef" function, it generates the following results:

a, ab, abc, abcd, abcde, abcdef, abcdf, abce, abcef, abcf, abd, abde, abdef, abdf, abe, abef, abf, ac, acd, acde, acdef, acdf, ace, acef, acf, ad, ade, adef, adf, ae, aef, af, b, bc, bcd, bcde, bcdef, bcdf, bce, bcef, bcf, bd, bde, bdef, bdf, be, bef, bf, c, cd, cde, cdef, cdf, ce, cef, cf, d, de, def, df, e, ef, f

0
Jul 25 '15 at 19:35
source share

The naive algorithm accepts O (n ^ 3) time instead of O (n ^ 2) time. There is an O (n ^ 2) number of substrings. And if you put O (n ^ 2) the number of substrings, for example, set, then it compares O (lgn) comparisons for each line to check if it is alrady in the set or not. In addition, O (n) time is required to compare strings. Therefore, if you use set, it will take O (n ^ 3 lgn) time. and you can reduce O (n ^ 3) time if you use a hash table instead of the set one.

The fact is that this is a string comparison, not a number comparison.

So, one of the best algorithms allows us to say that if you use the suffix algorithm and the longest common prefix (LCP), it reduces O (n ^ 2) time for this problem. Building a suffix array using the O (n) time algorithm. Time for LCP = O (n). Since for each pair of lines in the suffix array, LCP is so the total time is O (n ^ 2) time to find the length of the various adjustments.

Also, if you want to print all the different substrings, it will take O (n ^ 2) time.

0
Oct 11 '15 at 2:58
source share

Prints unique substrings. https://ideone.com/QVWOh0

 def uniq_substring(test): lista=[] [lista.append(test[i:i+k+1]) for i in range(len(test)) for k in range(len(test)-i) if test[i:i+k+1] not in lista and test[i:i+k+1][::-1] not in lista] print lista uniq_substring('rohit') uniq_substring('abab') ['r', 'ro', 'roh', 'rohi', 'rohit', 'o', 'oh', 'ohi', 'ohit', 'h', 'hi', 'hit', 'i', 'it', 't'] ['a', 'ab', 'aba', 'abab', 'b', 'bab'] 
0
May 23 '16 at
source share

Try this code using an suffix array and the longest common prefix. It can also give you the total number of unique substrings. The code may lead to a stack overflow in visual studio, but works fine in Eclipse C ++. This is because it returns vectors for functions. Did not check it on very long lines. Do it and report back.

  // C++ program for building LCP array for given text #include <bits/stdc++.h> #include <vector> #include <string> using namespace std; #define MAX 100000 int cum[MAX]; // Structure to store information of a suffix struct suffix { int index; // To store original index int rank[2]; // To store ranks and next rank pair }; // A comparison function used by sort() to compare two suffixes // Compares two pairs, returns 1 if first pair is smaller int cmp(struct suffix a, struct suffix b) { return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0): (a.rank[0] < b.rank[0] ?1: 0); } // This is the main function that takes a string 'txt' of size n as an // argument, builds and return the suffix array for the given string vector<int> buildSuffixArray(string txt, int n) { // A structure to store suffixes and their indexes struct suffix suffixes[n]; // Store suffixes and their indexes in an array of structures. // The structure is needed to sort the suffixes alphabatically // and maintain their old indexes while sorting for (int i = 0; i < n; i++) { suffixes[i].index = i; suffixes[i].rank[0] = txt[i] - 'a'; suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1; } // Sort the suffixes using the comparison function // defined above. sort(suffixes, suffixes+n, cmp); // At his point, all suffixes are sorted according to first // 2 characters. Let us sort suffixes according to first 4 // characters, then first 8 and so on int ind[n]; // This array is needed to get the index in suffixes[] // from original index. This mapping is needed to get // next suffix. for (int k = 4; k < 2*n; k = k*2) { // Assigning rank and index values to first suffix int rank = 0; int prev_rank = suffixes[0].rank[0]; suffixes[0].rank[0] = rank; ind[suffixes[0].index] = 0; // Assigning rank to suffixes for (int i = 1; i < n; i++) { // If first rank and next ranks are same as that of previous // suffix in array, assign the same new rank to this suffix if (suffixes[i].rank[0] == prev_rank && suffixes[i].rank[1] == suffixes[i-1].rank[1]) { prev_rank = suffixes[i].rank[0]; suffixes[i].rank[0] = rank; } else // Otherwise increment rank and assign { prev_rank = suffixes[i].rank[0]; suffixes[i].rank[0] = ++rank; } ind[suffixes[i].index] = i; } // Assign next rank to every suffix for (int i = 0; i < n; i++) { int nextindex = suffixes[i].index + k/2; suffixes[i].rank[1] = (nextindex < n)? suffixes[ind[nextindex]].rank[0]: -1; } // Sort the suffixes according to first k characters sort(suffixes, suffixes+n, cmp); } // Store indexes of all sorted suffixes in the suffix array vector<int>suffixArr; for (int i = 0; i < n; i++) suffixArr.push_back(suffixes[i].index); // Return the suffix array return suffixArr; } /* To construct and return LCP */ vector<int> kasai(string txt, vector<int> suffixArr) { int n = suffixArr.size(); // To store LCP array vector<int> lcp(n, 0); // An auxiliary array to store inverse of suffix array // elements. For example if suffixArr[0] is 5, the // invSuff[5] would store 0. This is used to get next // suffix string from suffix array. vector<int> invSuff(n, 0); // Fill values in invSuff[] for (int i=0; i < n; i++) invSuff[suffixArr[i]] = i; // Initialize length of previous LCP int k = 0; // Process all suffixes one by one starting from // first suffix in txt[] for (int i=0; i<n; i++) { /* If the current suffix is at n-1, then we don't have next substring to consider. So lcp is not defined for this substring, we put zero. */ if (invSuff[i] == n-1) { k = 0; continue; } /* j contains index of the next substring to be considered to compare with the present substring, ie, next string in suffix array */ int j = suffixArr[invSuff[i]+1]; // Directly start matching from k'th index as // at-least k-1 characters will match while (i+k<n && j+k<n && txt[i+k]==txt[j+k]) k++; lcp[invSuff[i]] = k; // lcp for the present suffix. // Deleting the starting character from the string. if (k>0) k--; } // return the constructed lcp array return lcp; } // Utility function to print an array void printArr(vector<int>arr, int n) { for (int i = 0; i < n; i++) cout << arr[i] << " "; cout << endl; } // Driver program int main() { int t; cin >> t; //t = 1; while (t > 0) { //string str = "banana"; string str; cin >> str; // >> k; vector<int>suffixArr = buildSuffixArray(str, str.length()); int n = suffixArr.size(); cout << "Suffix Array : \n"; printArr(suffixArr, n); vector<int>lcp = kasai(str, suffixArr); cout << "\nLCP Array : \n"; printArr(lcp, n); // cum will hold number of substrings if that'a what you want (total = cum[n-1] cum[0] = n - suffixArr[0]; // vector <pair<int,int>> substrs[n]; int count = 1; for (int i = 1; i <= n-suffixArr[0]; i++) { //substrs[0].push_back({suffixArr[0],i}); string sub_str = str.substr(suffixArr[0],i); cout << count << " " << sub_str << endl; count++; } for(int i = 1;i < n;i++) { cum[i] = cum[i-1] + (n - suffixArr[i] - lcp[i - 1]); int end = n - suffixArr[i]; int begin = lcp[i-1] + 1; int begin_suffix = suffixArr[i]; for (int j = begin, k = 1; j <= end; j++, k++) { //substrs[i].push_back({begin_suffix, lcp[i-1] + k}); // cout << "i push " << i << " " << begin_suffix << " " << k << endl; string sub_str = str.substr(begin_suffix, lcp[i-1] +k); cout << count << " " << sub_str << endl; count++; } } /*int count = 1; cout << endl; for(int i = 0; i < n; i++){ for (auto it = substrs[i].begin(); it != substrs[i].end(); ++it ) { string sub_str = str.substr(it->first, it->second); cout << count << " " << sub_str << endl; count++; } }*/ t--; } return 0; } 

And here is a simpler algorithm:

 #include <iostream> #include <string.h> #include <vector> #include <string> #include <algorithm> #include <time.h> using namespace std; char txt[100000], *p[100000]; int m, n; int cmp(const void *p, const void *q) { int rc = memcmp(*(char **)p, *(char **)q, m); return rc; } int main() { std::cin >> txt; int start_s = clock(); n = strlen(txt); int k; int i; int count = 1; for (m = 1; m <= n; m++) { for (k = 0; k+m <= n; k++) p[k] = txt+k; qsort(p, k, sizeof(p[0]), &cmp); for (i = 0; i < k; i++) { if (i != 0 && cmp(&p[i-1], &p[i]) == 0){ continue; } char cur_txt[100000]; memcpy(cur_txt, p[i],m); cur_txt[m] = '\0'; std::cout << count << " " << cur_txt << std::endl; count++; } } cout << --count << endl; int stop_s = clock(); float run_time = (stop_s - start_s) / double(CLOCKS_PER_SEC); cout << endl << "distinct substrings \t\tExecution time = " << run_time << " seconds" << endl; return 0; } 

Both algorithms are simply too slow for extremely long strings. I tested the algorithms against a string longer than 47,000, and the algorithms took more than 20 minutes, the first one taking 1200 seconds and the second 1360 seconds, and this is just counting unique substrings without output to the terminal. Therefore, perhaps for strings up to 1000 in length, you can get a working solution. However, both solutions computed the same total number of unique substrings. I tested both algorithms with respect to the string lengths of 2000 and 10000. The time was for the first algorithm: 0.33 s and 12 s; for the second algorithm, it was 0.535 s and 20 s. So, as a rule, the first algorithm is faster.

0
Oct 22 '16 at 13:53 on
source share

Many answers, including 2 for loops and a .substring () call, require O (N ^ 2) time complexity. However, it is important to note that the worst case for calling .substring () in Java (after updating 6 in Java 7) is O (N). Therefore, adding a .substring () call to your code, the order of N is increased by one.

Therefore, 2 for loops and a .substring () call inside these loops is O (N ^ 3) of time complexity.

0
Jul 03 '19 at 5:05
source share

Your programs do not give unique sbstrins.

Test using abab input, and the output should be aba,ba,bab,abab .

-2
Apr 08 '15 at 8:41
source share



All Articles