How to solve exact match with convolution

I am trying to solve the problem of exact pattern matching when the alphabet consists of 5 characters {a, b, c, d, #}, where the special character # matches any character (including itself).

For example, if T = ab # aca # ab # a and P = da # ac, then P starts at position 3 in T. I'm trying to find an O (nlogn) time algorithm to determine if a pattern P of length n is found in the text T of length 2n, assuming that the symbol # occurs (possibly O (n) times) in T and P.

Any suggestions on how to solve it using convolution?

+4
source share
3 answers

I think you can handle the problem, as indicated, for arbitrary alphabet sizes, given floating point precision. For an alphabet of n characters, map the text characters to the (complex) nth roots of one. For pattern characters, map # to 0 and match regular characters with the multiplicative inverse of the corresponding text, as well as the nth roots of one.

Then you use the convolution theorem to determine at each point a point product of the text from that point with the pattern. If the text and the pattern match, each component of the product will be either 0 (on the wild card) or 1, from r * r ^ -1, so if you have a match, the result will be m, where m is the number of asymmetric characters in the pattern. If there is no coincidence, then not all components of the point product will be 1. Considering these complex numbers as two-dimensional vectors, the point product of these vectors with a vector representing 1 will be less than m, therefore, the mismatch cannot cause the result m and look like a match.

I note that if you divide the text into buffers several times by the length of the template, you can use the FFT of this length quite efficiently, so the complexity is not n log n, where n is the length of the text to be searched, but n log m, where m - the length of the template. Despite this, I will need to see the control points before I believe that this is a competitive method, compared even with a naive search.

+5
source

The BNDM algorithm can handle wildcards, as this implementation shows , and matches your speed requirement in the middle case. However, it has the worst complexity O (nm).

Why exactly do you need a convolution here?

0
source

I have an idea how to do this.

Calculate the cross-correlation function between T and P. This is done by folding T and P. For long signals, convolution is most efficiently done by fast Fourier transform and it takes time proportional to N * log (N):

Cross correlation
Convolution
Fast Fourier Transform

Then find the maximum cross-correlation function. It will indicate the position in which T and P are aligned.

Convolution will treat "#" as a special case, and if P is not found in T, this should be explicitly checked at the end.

0
source

All Articles