Understanding the Knut-Morris-Pratt Algorithm

Question

Understanding the Knut-Morris-Pratt Algorithm

Can someone explain this to me? I read about it, and it's still hard for me to follow.

text: ababdbaababa
sample: ababa

table for ababa - -1 0 0 1 2.

I think I understand how the table is built, but I do not understand how to move after the discrepancy has occurred. Looks like we don’t even use the table when moving?

when do we use the table?

+6

string algorithm pattern-matching

antz Nov 07 '12 at 14:34

source share

4 answers

Here I briefly described the calculation of a prefix function and moving around the text here.

For More Information: Knuth-Morris-Pratt String Search Algorithm

Text Navigation:

 Text: ABC ABCDAB ABCDABCDABDE Pattern : ABCDABD

Scenario 1 . The template and the text have / some matching characters / s.
e.g. 1: characters 3 are present here.

Get the value from the table for characters 3 . (index 2, ABC), i.e. 0 Therefore, the shift = 3 - 0, i.e. 3

for example, 2: there are 6 matching characters.

Get value from table for 6 . (index 5, ABCDAB), i.e. 2 Therefore shift = 6 - 2 ie 4

Scenario 2 . If there are no matching characters, slide one.

+25

Rukmal dias May 11 '14 at 16:20

source share

Well, this is an old topic, but hopefully someone who is looking for it in the future will see it. The answer above is good, but I myself worked through an example to understand exactly what was going on.

The first part of the presentation is taken from the wiki, the part that I really wanted to talk about is how this backtracking array is built.

Here:

we work through the (relatively artificial) launch of the algorithm, where

 W = "ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE".

At any time, the algorithm is in a state defined by two integers:

m , which denotes the position in S, which is the beginning of the proposed match for W

i index in W denoting the symbol in question.

At each step, we compare S[m+i] with W[i] and move forward if they are equal. This is shown at the start of the run, for example

  1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456

We proceed from the comparison of consecutive W symbols with "parallel" S symbols, passing from one to another if they coincide. However, in the fourth stage, we get S [3] - space and W [3] = 'D', a mismatch. Instead of starting the search again on S [1], we note that between positions 0 and 3 in S there is no “A” other than 0; therefore, having previously checked all these symbols, we know that there is no way to find the beginning of a match if we check them again. Therefore, we move on to the next character, set m = 4 and i = 0.

  1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456

We quickly get the almost complete correspondence "ABCDAB" when a mismatch occurs again in W [6] (S [10]). However, only until the end of the current partial did we pass “AB”, which could be the beginning of a new match, so we must take this into account. Since we already know that these characters correspond to two characters before the current position, we do not need to check them again; we just reset m = 8, i = 2 and continue to match the current character. Thus, we not only omit previously matched S characters, but also previously matched W characters.

  1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456

This search is not performed immediately, however, since the template still does not contain a space, since in the first test we return to the beginning of W and start the search on the next character S: m = 11, reset i = 0.

  1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456

Again, we immediately fall into the match “ABCDAB”, but the next character “C” does not correspond to the final character “D” of the word W. Reasoning as before, we set m = 15 to start with the two-character string “AB” leading to the current position, set i = 2 and continue matching with the current position.

  1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456

This time we can finish the match, the first character of which is S [15].

The above example contains all the elements of the algorithm. At the moment, we assume the existence of a “partial correspondence” table T, described below, which indicates where we need to look for the beginning of a new match if the current one ends with a non-match. The entries of T are constructed so that if the match begins with S [m], which does not work when comparing S [m + i] with W [i], then the next possible match begins with the index m + i - T [i] in S (ie T [i] is the amount of “backtracking” that we must perform after the non-compliance). This has two meanings: first, T [0] = -1, which indicates that if W [0] is a mismatch, we cannot backtrack and should just check the next character; and secondly, although the next possible match begins with the index m + iT [i], as in the example above, we do not need to check any of the characters T [i] after that, to continue the search with W [T [i] ].

CONSTRUCTION OF MACHINERY:

so this backtrack array T [] we will call lps [], let's see how we calculate this guy

 lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].

Examples: For the template "AABAACAABAA",

 lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

// so just move on to this quick

  lps[0] is just 0 by default lps[1] is 1 because it looking at AA and A is both a prefix and suffix lps[2] is 0 because it looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules lps[3] is 1 because it looking at AABA and first A matches last A lps[4] is 2 becuase it looking at AABAA and first 2 A matches last 2 A lps[5] is 0 becuase it looking at AABAAC and nothing matches C ... For the pattern "ABCDE", lps[] is [0, 0, 0, 0, 0] For the pattern "AAAAA", lps[] is [0, 1, 2, 3, 4] For the pattern "AAABAAA", lps[] is [0, 1, 2, 0, 1, 2, 3] For the pattern "AAACAAAAAC", lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]

And that makes sense if you think about it ... if you don’t agree, you want to go back as far as possible, how far you go (the suffix part) is essentially a prefix, because by definition you have to start matching with the first character again . so if your line looks like

aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as a new head, just think about it through

+9

tweaking Nov 25 '13 at 16:39

source share

Complete solution using Java:

 package src.com.recursion; /* * This Expains the Search of pattern in text in O(n) */ public class FindPatternInText { public int checkIfExists(char[] text, char[] pattern) { int index = 0; int[] lps = new int[pattern.length]; createPrefixSuffixArray(pattern, lps); int i = 0; int j = 0; int textLength = text.length; while (i < textLength) { if (pattern[j] == text[i]) { j++; i++; } if (j == pattern.length) return i - j; else if (i < textLength && pattern[j] != text[i]) { if (j != 0) { j = lps[j - 1]; } else { i++; } } } return index; } private void createPrefixSuffixArray(char[] pattern, int[] lps) { lps[0] = 0; int index = 0; int i = 1; while (i < pattern.length) { if (pattern[i] == pattern[index]) { lps[i] = index; i++; index++; } else { if (index != 0) { index = lps[index - 1]; } else { lps[i] = 0; i++; } } } } public static void main(String args[]) { String text = "ABABDABACDABABCABAB"; String pattern = "ABABCABAB"; System.out.println("Point where the pattern match starts is " + new FindPatternInText().checkIfExists(text.toCharArray(), pattern.toCharArray())); } }

0

Akhil gupta Sep 28 '16 at 12:00

source share

Origin · Accepted Answer · 2012-11-07T15:09:08+0000

The table is used when a mismatch occurs. Let apply the template to your text:

You start matching the text with the template and check if your template can be in the text starting from the first position. You are comparing text[1] with pattern[1] , and that turns out to be a coincidence. You do the same for text[2] , text[3] and text[4] .

when you want to combine text[5] with pattern[5] , you have no match ( d <> a ). Then you know that your template will not start from the first position. Then you can start the re-matching for position 2, but this is not effective. You can use the table now .

The error occurred when pattern[5] , so you go to table[5] , which is 2. This means that you can start matching again at the current position with 2 characters already matched . Instead of starting to compare position 2, you can start work in the previous position (1) + table[5] (2) = 3. Indeed, if we look at text[3] and text[4] , we will see that it is equal to pattern[1] and pattern[2] . Really.

The numbers in the table show how many positions have already been agreed upon when an error occurs. In this case, 2 characters of the following pattern have already been matched. Then you can immediately start matching for position 3 and skip position 2 (since the pattern cannot be found starting at position [2]).

Understanding the Knut-Morris-Pratt Algorithm

More articles: