Fuzzy Fast String Search and Indexing Algorithm

I need to find a set of substrings (each about 32 characters) in a very large line (about 100k) as quickly as possible. I need a fuzzy search .

What is the best algorithm? I tried scanning the entire large line for small lines and checking the Levenshtein distance for each step, but this takes a lot of time.

+4
source share
2 answers

Take a look at the BLAST algorithm ( http://en.wikipedia.org/wiki/BLAST ). It is used to search for a sequence (for example, DNA search). The underlying problem is very similar to yours.

Essentially, you do these short lines of the index and find areas where matches are found in abundance, and do a more expensive search in that region.

+2
source

If I understand what you want correctly (you want to find subsequences of a large string that are equal to a given set of strings of length 32), and your alphabet has a reasonable size (letters, numbers and punctuation marks for an instance), you can do the following:

  • Find the first occurrence of each letter.

  • ( O(l * n), l - , n - )

  • , ..

O(l * n) , O(m), m - .

+1

All Articles