Search for all matches from a large set of rows in a smaller row

I have a large set of words and phrases (vocabulary or dictionary) that includes wildcards. I need to find all instances of these words and phrases within a much smaller string (~ 150 characters at the moment).

At first I wanted to start the operation in the reverse order; to check if every word exists in my smaller line in the Lexicon, which can be implemented as a hash table. The problem is that some of these meanings in my Lexicon are not single words, but many are wildcards (e.g. substri *).

I am thinking of using the Rabin-Karp algorithm , but I'm not sure if this is the best choice.

What is an efficient algorithm or method to perform this operation?

Sample data:

The dictionary contains hundreds of words and can potentially expand. These words can end with wildcards (asterisks). Here are some random examples:

  • well
  • badly
  • freed *
  • carelessly *
  • big loss

The text that we are analyzing (at this stage) is a short, informal (grammatical) English statement. A vivid example of the text (again, at the moment) will be Twitter. They are limited to approximately 140 characters. For instance:

Just got the Google nexus without a contract. Hands down its the best phone I've ever had and the only thing that could've followed my N900. 

Although it may be useful to note that we are doing a very simple sentiment analysis on this text; our mood analysis technique is not my problem. I simply transfer the existing solution to the real-time processing system and have to perform some optimizations.

+4
source share
7 answers

I think this is a great precedent for the Aho-Corasick string matching algorithm , which is specifically designed to find all matches a large set of strings in a single string. It starts in two phases - the first phase in which the corresponding automaton is created (which can be done in advance and only linear time is required), and the second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to total number of matches). The algorithm can also be adapted to support wildcard searches.

Hope this helps!

+4
source

One answer I wanted to throw out is the Boyer-Moore search algorithm. This is an algorithm that uses grep . Grep is probably one of the fastest search tools. Alternatively, you can use something like GNU Parallel so that grep runs in parallel, thereby really speeding up the algorithm.

Also, here is a good article that you may find interesting.

+3
source

You can still use your original idea by checking every word in the text for a dictionary. However, in order to effectively execute it, you need to index the dictionary to make the search very fast. The trick used in information recovery systems is to store the so-called permuterm index ( http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html ).

Basically, what you want to do is to store all kinds of permutations of words in the dictionary (for example, for home):

 house$ ouse$h use$ho ... e$hous 

This index can then be used to quickly check against wildcard queries. If, for example, you have ho * e, you can look in the permuterm index for a term starting with e$ho , and you would quickly find a match for the house.

The search itself is usually performed using some kind of logarithmic search strategy (binary search or b-trees) and, therefore, is usually very fast.

+2
source

So far, templates for full words: you do not want or match storage ; spaces and punctuation marks are correspondence anchors, then an easy way to translate your vocabulary into the input of the scanner generator (for example, you could use flex ), generate a scanner, and then run it at your input.

Scanner generators are designed to identify the occurrences of tokens in the input, where each type of token is described by a regular expression. Flex and similar programs quickly create scanners. By default, Flex processes up to 8k rules (in your vocabulary entries), and this can be extended. The generated scanners work in linear mode and in practice are very fast.

The internally regular token expressions are converted to the standard Klein pipeline theorem: first into the NFA, then into the DFA. Then the DFA is converted to its unique minimal form. This is encoded in the HLL table, which is emitted inside the shell that implements the scanner, referencing the table. This is what makes it flexible, but other strategies are possible. For example, DFA can be translated into goto code, where the state of the DFA appears to be an implicit instruction pointer as the code runs.

The reason for the obstacles associated with spaces is that scanners created by programs such as Flex are usually unable to identify matching matches: strangers cannot match both strangers and range , for example.

Here is a flexible scanner that matches the vocabulary of the example you gave:

 %option noyywrap %% "good" { return 1; } "bad" { return 2; } "freed"[[:alpha:]]* { return 3; } "careless"[[:alpha:]]* { return 4; } "great"[[:space:]]+"loss" { return 5; } . { /* match anything else except newline */ } "\n" { /* match newline */ } <<EOF>> { return -1; } %% int main(int argc, char *argv[]) { yyin = argc > 1 ? fopen(argv[1], "r") : stdin; for (;;) { int found = yylex(); if (found < 0) return 0; printf("matched pattern %d with '%s'\n", found, yytext); } } 

And to run:

 $ flex -i foo.l $ gcc lex.yy.c $ ./a.out Good men can only lose freedom to bad matched pattern 1 with 'Good' matched pattern 3 with 'freedom' matched pattern 2 with 'bad' through carelessness or apathy. matched pattern 4 with 'carelessness' 
+2
source

This does not answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby, and various programming languages. In my experience, it blindly fast and removed fairly similar bottlenecks in my code with little fuss and almost no extra code on my part.

The only difficulty is with overlapping patterns. If you want patterns to begin with word boundaries, you must break the dictionary into a set of regular expressions r_1, r_2, ..., r_k form \b(foobar|baz baz\S*|...) , where each group is from r_{i+1} has a prefix in r_i . Then you can get short circuit ratings, since if r_{i+1} matches, then r_i must match.

If you do not implement your algorithm in highly optimized C, I would bet that this approach will be faster than any of the algorithmically superior answers elsewhere in this thread.

+1
source

Let me get it straight. You have a large set of queries and one small string, and you want to find instances of all these queries in this string.

In this case, I suggest you index this small document as crazy so that the search time is as short as possible. Hell. With this document size, I would even think about making small mutations (to substitute wildcards, etc.), and also index them.

0
source

I had a very similar task. Here, as I decided, performance is incredible http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf

-1
source