Optimization of building trie over all substrings

Question

Optimization of building trie over all substrings

I am solving a trie related problem. There are many strings S. I need to create a trie over all the substrings for each string in S. I use the following procedure:

String strings[] = { ... }; // array containing all strings for(int i = 0; i < strings.length; i++) { String w = strings[i]; for (int j = 0; j < w.length(); j++) { for (int k = j + 1; k <= w.length(); k++) { trie.insert(w.substring(j, k)); } } }

I am using the trie implementation provided here . However, I am wondering if there are certain optimizations that can be done to reduce the complexity of creating trie across all substrings?

Why do I need it? Because I'm trying to solve this problem .

+7

optimization algorithm data-structures suffix-tree trie

Bhoot Jul 01 '15 at 16:11

source share

4 answers

kajacx · Answer 1 · 2015-07-01T16:50:07+0000

If we have N words, each with a maximum length of L , your algorithm will take O(N*L^3) (suppose the addition to trie is linear with the length of the adding word). However, the size of the resulting trie (the number of nodes) does not exceed O(N*L^2) , so it looks like you are wasting time and you can do better.

Indeed, you can, but you need to pull a few tricks from your sleeve. In addition, you will no longer need trie.

.substring() in constant time

In Java 7, each String had a backup char[] array, as well as an initial position and length. This allowed us to use the .substring() method in constant time since String is an immutable class. A new String object with the same char[] background array was created only with a different starting position and length.

You will need to slightly increase this number in order to support the addition at the end of the line, increasing the length. Always create a new string object, but leave an array of support.

Reassign the hash in constant time after adding a single character

Again, let me use the Java hashCode() function for String :

 int hash = 0; for (int i = 0; i < data.length; i++) { hash = 31 * hash + data[i]; } // data is the backing array

Now, how will the hash change after adding one character at the end of a word? Just add its value (ASCII code) multiplied by 31^length . You can save credentials 31 in a separate table, other prime numbers can also be used.

Save all substrings in one HashMap

Using tricks 1 and 2, you can generate all the substrings in time O(N*L^2) , which is the total number of substrings. Always start with a line of length one and add one character at a time. Put all your lines in one HashMap to reduce duplicity.

(You can skip 2 and 3 and discard duplicates when sorting / after sorting, maybe it will be even faster.)

Sort your substrings and you're good to go.

Well, when I got to point 4, I realized that my plan would not work, because when sorting you need to compare strings, and this can take O(L) . I came up with several attempts to solve this problem, including sorting the bucket, but no one will be faster than the original O(N*L^3)

I will just answer here if he inspires someone.

If you do not know the Aho-Corasic algorithm , pay attention to this, this may be useful for your problem.

newSolar · Answer 2 · 2015-07-03T13:02:44+0000

You may need a suffix machine . It costs only O (n) time and can recognize all substrings.

Suffix array can also solve these problems.

These two algorithms can solve most string problems and are really hard to learn. After you learn those, you will allow it.

n00bc0d3r · Answer 3 · 2015-07-01T16:35:35+0000

You may consider the following optimization:

Maintain a list of processed substrings. When inserting a substring, check to see if the processed set contains this particular substring, and if so, skip pasting that substring into the trie.

However, the complexity of the worst case for inserting all substrings in trie will be of order n ^ 2, where n is the size of the string array. On the problem page, this works on the order of 10 ^ 8 input operations in three. Therefore, even if each insertion takes 10 operations on average, you will have only 10 ^ 9 operations, which will allow you to exceed the time limit.

The problem page refers to the LCP array as the corresponding section of the problem. You should consider changing the approach.

dened · Answer 4 · 2015-07-05T15:22:11+0000

First, note that just adding the suffix to the trie is enough, and nodes for each substring will be added along the way.

Secondly, you have to compress the trigger , otherwise it will not fit into the memory limit imposed by HackerRank. It will also make your decision faster.

I just presented my decision implementing these proposals, and it was accepted. (The maximum execution time was 0.08 seconds.)

But you can make your decision even faster by running a linear time algorithm to build a tree of suffixes . You can read about linear time suffix tree algorithms here and here . There is also an explanation of Ukkonen's algorithm on StackOverflow here .

Optimization of building trie over all substrings

More articles: