Well, this is an old topic, but hopefully someone who is looking for it in the future will see it. The answer above is good, but I myself worked through an example to understand exactly what was going on.
The first part of the presentation is taken from the wiki, the part that I really wanted to talk about is how this backtracking array is built.
Here:
we work through the (relatively artificial) launch of the algorithm, where
W = "ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE".
At any time, the algorithm is in a state defined by two integers:
m , which denotes the position in S, which is the beginning of the proposed match for W
i index in W denoting the symbol in question.
At each step, we compare S[m+i] with W[i] and move forward if they are equal. This is shown at the start of the run, for example
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We proceed from the comparison of consecutive W symbols with "parallel" S symbols, passing from one to another if they coincide. However, in the fourth stage, we get S [3] - space and W [3] = 'D', a mismatch. Instead of starting the search again on S [1], we note that between positions 0 and 3 in S there is no “A” other than 0; therefore, having previously checked all these symbols, we know that there is no way to find the beginning of a match if we check them again. Therefore, we move on to the next character, set m = 4 and i = 0.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We quickly get the almost complete correspondence "ABCDAB" when a mismatch occurs again in W [6] (S [10]). However, only until the end of the current partial did we pass “AB”, which could be the beginning of a new match, so we must take this into account. Since we already know that these characters correspond to two characters before the current position, we do not need to check them again; we just reset m = 8, i = 2 and continue to match the current character. Thus, we not only omit previously matched S characters, but also previously matched W characters.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This search is not performed immediately, however, since the template still does not contain a space, since in the first test we return to the beginning of W and start the search on the next character S: m = 11, reset i = 0.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
Again, we immediately fall into the match “ABCDAB”, but the next character “C” does not correspond to the final character “D” of the word W. Reasoning as before, we set m = 15 to start with the two-character string “AB” leading to the current position, set i = 2 and continue matching with the current position.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This time we can finish the match, the first character of which is S [15].
The above example contains all the elements of the algorithm. At the moment, we assume the existence of a “partial correspondence” table T, described below, which indicates where we need to look for the beginning of a new match if the current one ends with a non-match. The entries of T are constructed so that if the match begins with S [m], which does not work when comparing S [m + i] with W [i], then the next possible match begins with the index m + i - T [i] in S (ie T [i] is the amount of “backtracking” that we must perform after the non-compliance). This has two meanings: first, T [0] = -1, which indicates that if W [0] is a mismatch, we cannot backtrack and should just check the next character; and secondly, although the next possible match begins with the index m + iT [i], as in the example above, we do not need to check any of the characters T [i] after that, to continue the search with W [T [i] ].
CONSTRUCTION OF MACHINERY:
so this backtrack array T [] we will call lps [], let's see how we calculate this guy
lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].
Examples: For the template "AABAACAABAA",
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
// so just move on to this quick
lps[0] is just 0 by default lps[1] is 1 because it looking at AA and A is both a prefix and suffix lps[2] is 0 because it looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules lps[3] is 1 because it looking at AABA and first A matches last A lps[4] is 2 becuase it looking at AABAA and first 2 A matches last 2 A lps[5] is 0 becuase it looking at AABAAC and nothing matches C ... For the pattern "ABCDE", lps[] is [0, 0, 0, 0, 0] For the pattern "AAAAA", lps[] is [0, 1, 2, 3, 4] For the pattern "AAABAAA", lps[] is [0, 1, 2, 0, 1, 2, 3] For the pattern "AAACAAAAAC", lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
And that makes sense if you think about it ... if you don’t agree, you want to go back as far as possible, how far you go (the suffix part) is essentially a prefix, because by definition you have to start matching with the first character again . so if your line looks like
aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as a new head, just think about it through