We have two sets: A and B. Each of these sets includes strings. for example: A - {"abwcd", "dwas", "www"} and B - {"opqr", "tops", "ibmd"} How to count the subsequences that appear on all lines from set A, but not in one of the lines in set B? For the example above, the answer is 1 (subsequence "w").
All this in an optimal way. I thought about using two attempts, the first time I put all the subsequences of all the rows in B in trie t_B, and then, I started putting all the subsequences of all rows in in trie t_A, without updating trie if the same subsequence was found earlier in that same line (for example: if I have the line "aba", I do not count the subsequence "a" twice). Thus, if I find a subsequence that has n (size A) occurrences in t_A, I will check if it is in t_B, and if not, I consider it. But it is very slow, if A and B are 15 and the lines are about 100 characters long, my programs run for more than 1 second.
EDIT: Since any subsqeunce ends in the last character of the string or character before it, we do not need to generate all subsequences, but those that end with the last character of the string. When I insert them into trie, I mark each node with 1. So, if I have the line "abcd", I just press "abcd", "bcd", "cd" and "d", as it should be " skeleton "three. But this is not a very big optimization, I'm still looking for something better.
Robert Badea
source share