Trie & sequenceences

We have two sets: A and B. Each of these sets includes strings. for example: A - {"abwcd", "dwas", "www"} and B - {"opqr", "tops", "ibmd"} How to count the subsequences that appear on all lines from set A, but not in one of the lines in set B? For the example above, the answer is 1 (subsequence "w").

All this in an optimal way. I thought about using two attempts, the first time I put all the subsequences of all the rows in B in trie t_B, and then, I started putting all the subsequences of all rows in in trie t_A, without updating trie if the same subsequence was found earlier in that same line (for example: if I have the line "aba", I do not count the subsequence "a" twice). Thus, if I find a subsequence that has n (size A) occurrences in t_A, I will check if it is in t_B, and if not, I consider it. But it is very slow, if A and B are 15 and the lines are about 100 characters long, my programs run for more than 1 second.

EDIT: Since any subsqeunce ends in the last character of the string or character before it, we do not need to generate all subsequences, but those that end with the last character of the string. When I insert them into trie, I mark each node with 1. So, if I have the line "abcd", I just press "abcd", "bcd", "cd" and "d", as it should be " skeleton "three. But this is not a very big optimization, I'm still looking for something better.

+7
source share
1 answer

You do not need to put all the subsequences of all the lines in in trie. Insert only valid ones. Check if the sequence is valid before adding. I assume the membership test is faster than adding a new item. Less trie fails membership tests faster, so this strategy is designed to cut trie as quickly as possible.

In particular: Put all the subsequences from the first line in to in the trie. (For efficiency, use the shortest string as the first). Keep a set of links to all leaf nodes. Then, for all lines in B, check each subsequence to see if it exists in A. If so, delete that sequence and link. (Start with the longest line in B to make the streamer as fast as possible).

You now have a minimal set of testing options. For all other lines in, check each subsequence to see if it exists in the trie. If so, mark node as valid, otherwise skip to the next subsequence. After each line, remove all invalid nodes from trie and reset the rest of the flags to invalid ones.

+3
source

All Articles