Aho-Korasik and the correct substrings

I am trying to understand aho-corasick string matching algorithm. Suppose our patterns are abcd and bc . We get a tree like this

  [] /\ [a]..[b] / : | [b].: [c] | : [c]..... | [d] 

The dashed line indicates the failure function.

Now suppose we are rooted in the string abcd . This will follow the tree and determine the match "abcd", however, as far as I can tell, the match bc not reported. Do I not understand the algorithm?

+6
substring string-matching algorithm aho-corasick
source share
3 answers

Artem answers correctly, but perhaps not very clearly. Basically, you need to do the following: every time you come to a new node, examine the entire path starting with this node and including links to errors and find matches on this path. (This does not change your current position). In your example, you will consider the paths b-> b (no matches found) and c-> c (match bc found).

Note that for performance reasons, you need to cache the β€œnext match” value on each node. That is, if you explore the path starting with node u , and after a few steps find the corresponding node v , remember the value next_match[u] = v , so that the next time you need to explore this path, you can make one jump directly to v .

+3
source share

You should mark the nodes as real_final if there is a line in this node. In your example, such nodes are "abcd" and "bc". After that, you should output the final states for the nodes: node is final if true_final or node according to the failure function is final. So, "abcd", "bc" and "abc" will be final.

In other words, a node is final if any template ends here, or some final node is accessible with the current node following the failure links.

+3
source share

The AhoCorasick tree setup part sets up pointers indicating where to go in the tree if the next character does not match. For example. if you follow the abcq sequence in the tree as you drew it, you need to go from position abc to position bc to see if q is below bc. You can use this pass to set up a different set of pointers to tell it to signal a match on bcd after signaling a match on abcd, etc.

In writing, I found the sgrep source at http://www.cs.helsinki.fi/~jjaakkol/sgrep.html very useful. As far as I remember, sgrep does LOT more than just AhoCorasick, but has an AhoCorsick implementation as part of this.

0
source share

All Articles