What are the rules for changing the Boyer-Moore string algorithm?

I tried to understand the shift rules in the Boyer-Moore string search algorithm, but did not understand them. I read wikipedia here, but it's too complicated!

This will be very useful if someone lists the rule in a simple way.

+8
algorithm string-search boyer-moore
source share
3 answers

In the Boyer-Moore algorithm, you begin to compare pattern characters with text characters from the end of the picture. If you find a mismatch, you have a type configuration

....xyzabc.... <-text ....uabc <- pattern ^ mismatch 

Now shifting a bad symbol means shifting the template so that the text mismatch symbol is aligned with the last occurrence of this symbol in the initial part of the template (template minus the last symbol of the template) if there is such an occurrence or one position before the template if the inconsistent character is not displayed at all in the initial part of the template.

It could be a left shift if the situation

  v ...xyzazc... ....uazc ..uazc 

so one does not guarantee progress.

Another shift, a good suffix shift, aligns the matched part of the text m with the smallest occurrence of this sequence of characters in the pattern, preceded by a different character (including none if the matched suffix is ​​also the pattern prefix) than the matching suffix m pattern - if there is such event.

So for example

  v ....abcdabceabcfabc... ...xabcfabcfabc ...xabcfabcfabc 

will lead to a good shift suffix for four positions, since the corresponding part m = abcfabc is found in the template in four places to the left of its appearance suffix and is preceded by a different character ( x instead of f ) than in the suffix position.

If there is no complete match in the pattern preceding a character other than the suffix, a good shift of the suffix aligns the suffix of the agreed part of the text with the pattern prefix, selects the maximum overlap, for example,

  v ...robocab.... abacab abacab 

A good suffix shift always shifts the pattern to the right, so it guarantees progress.

Then, at each mismatch, the achievement of the bad character shift and the good suffix shift are compared, and the more chosen. This is explained in more detail by Christian Charras and Thierry Lecrock here , as well as many other string search algorithms.


For the example mentioned in the comments,

 SSIMPLE EXAMPLE EXAMPLE ^ 

the matched suffix is MPLE , and the inconsistent text character is I Thus, a bad character shift searches for the last occurrence of I in the initial part of the pattern. This is not so, so a bad character shift would shift the pattern so that the irrelevant I was one before the start of the pattern

 SSIMPLE EXAMPLE EXAMPLE 

and a good suffix shift looks for the right-most appearance of MPLE in a template that is not preceded by A , or the longest suffix of MPLE , which is the template prefix. There is no complete appearance of the matched part in the pattern before the suffix, therefore the longest suffix of the matched part, which is also the template prefix, defines a good shift of the suffix. In this case, the two suffixes of the matched part, which are the template prefixes, are a one-character string E and an empty string. The longest, obviously non-empty line, so a good shift of the suffix aligns the one-character suffix E in the agreed part of the text with the one-character prefix of the template

 SSIMPLE EXAMPLE EXAMPLE 

A good suffix shift moves the pattern further to the right, so this is the selected shift.

Then there is an immediate mismatch at the last position of the pattern, and then an incorrect character shift aligns P in the text with P in the template (and a good suffix shift should not be considered at all if the mismatch occurs with the last character of the template, since in this case it will never produce a larger shift than shift of bad character).

Then we have full compliance.

In the TXAMPLE pattern TXAMPLE good suffix shift reveals that the non-empty suffix of the matched part is the pattern prefix (and there is no occurrence of the full matched part in the pattern not preceding A ), so a good suffix shift aligns the empty suffix of the matched part of the text (border between E and space ) with an empty template prefix (an empty string preceding T ), resulting in

 SSIMPLE EXAMPLE TXAMPLE 

(then, in the next step, the shift of bad characters aligns two L s, and the next mismatch at this stage occurs in the initial T pattern).

+16
source share

There is a good visualization here .

(EDIT: There is also a very good explanation with both examples and an example of how to implement the preprocessing steps here .)

General rules:

  • We are looking for how to align the template with text so that the matching parts match. If such alignment does not exist, the template is not found in the text.
  • Check each alignment from right to left, that is, start by checking whether the last character matches the current alignment pattern.
  • When you click on a character that does not align, increase the offset (slide the pattern) so that the last occurrence of the text letter in the pattern aligns with that occurrence of the text letter that is currently looking. This requires a preliminary construction (or search every time, but less effective) of the index of where each letter exists in the template.
  • If the character considered in the text does not appear in the template, skip ahead along the entire length of the template.
  • If the end of the picture sticks out beyond the end of the text (offset + length (pattern)> length (text)), the pattern is not displayed in the text.

What I just described is a "bad character" rule. The “good suffix” rule provides another option for switching; depending on what kind of shifts are next, this is the one you should take. It is possible to implement an algorithm without the correct suffix rule, but it will be less efficient after creating indexes.

The good suffix rule requires that you also know where to find each multi-character substring of the pattern. When you encounter a mismatch (checking, as always, from right to left), a shift with a good suffix moves the pattern to the point where already matching letters will do it again. Alternatively, if the selected part is unique in the template, you know that you can skip it all the way, because if it did not match when it was lined with a single occurrence, it could not match if it was lined with any other part of the picture.

For example, consider the following situation:

  • My model ends with a "dog."
  • Currently, I have compared it to the part of the text that ends with the “dog”.
  • Therefore, the bad letter is 's' (where they stop matching), and the good suffix is ​​“dog” (the part that matched).

I have two options:

  • Shift so that the first (from right to left) in the template is aligned with the text. If the pattern does not have 's', move the beginning of the pattern only to 's'.
  • Shift so that the next “dog” is aligned with the “dog” in the text. If there is no other “dog” in the template, drag the beginning of the picture to the end of the “dog”.

and I have to accept what allows me to move on.

If you are still confused, try asking a more specific question; it's hard to understand when we don’t know where you are stuck.

+5
source share

There are two heuristics: a bat heuristic and a good sketch heuristic.

First, you know, needle comparison begins at the end. So, if the characters do not match the needle, such at least the compared character in the haystack will correspond to the needle. E. g. the needle is “ABRACADABRA”, and the current character in the haystack is “B”, it does not correspond to the last “A”, and also does not correspond to the previous “R”, so a shift by one is meaningless, there will be no match. But "B" corresponds to the 2nd from the needle end symbol. Thus, we would move the needle by at least 2 positions. If the current character in the haystack does not match either in the needle, the needle should be offset outside the current character. In other words, we change the pattern until the current character in the gland is comparable to the needle, or the whole needle moves further.

The shift amount is calculated and stored in the array, so for "ABRACADABRA" it will be: ['R'] = 1, ['B'] = 2, ['D'] = 4, etc.

 haystack: XYABRACADABRA... | needle: ABRACADABRA ABRACADABRA <-- pointless shift: R will not match B ABRACADABRA 

Secondly, if a match is found at least for “ABRA” in the haystack (but without full matching), the needle can be moved so that the next “ABRA” matches.

The shift amount for the agreed part is also pre-calculated: e. d. ['A'] = 3, ['RA'] = 11, ['BRA'] = 11, ['ABRA'] = 7, ['DABRA'] = 7 ...

 haystack: XYZYXADABRACADABRA... needle: ABRACADABRA (shift to ABRA from matched ADABRA) ~~~~ ~~~~ ABRACADABRA 

This is not a complete expansion of all angular cases, but the main idea of ​​the algorithm.

+1
source share

All Articles