The search algorithm for a list of words in the text

Question

The search algorithm for a list of words in the text

I have a list of words, pretty small about 1000 or so. I want to check if any of the words in this list are in the input text. If so, I would like to know which ones are happening. The input text is several hundred words each, and these are text paragraphs from the Internet, which means that there are a lot of them from different sites. I am trying to find the best algorithm for it.

I see two obvious ways to do this -

The local way to search for each word from a list in the text.
Create a hash table of words from the input text, and then search for each word in the list in the hash table. It is fast.

Is there a better solution?

I use python, although I'm not sure if this will change the algorithm anyway.

As well as optimizing solution 2 above, I would like to save the hash table generated in persistent storage (DB) so that if the list of words changes, I can reuse the hash table without creating it again, Of course, if the input text changes, I need to create a hash table. Is it possible to save a hash table in the database? Any recommendations? I am currently using MongoDB for my project, and I can only store json documents. I am new to MongoDB and have just started working with it and still do not fully understand its full potential.

I searched SO and I see two questions on similar lines, and one of them offers a hash table, but I would like to get any pointers to the optimization that I have in mind.

Here are the previously asked questions about SO -

Is there an efficient algorithm for performing inverted full-text search?

EDIT: SO, .

, , -. , , . , , ?

+4

algorithm search

user220201 15 . '14 0:24

1

Jim Mischel · Accepted Answer · 2014-01-15T03:49:35+0000

, -. , , , , Aho-Corasick.

, , , . , .

- . , "", "", "" "" - . :

"dog|cat|horse|skunk"

. , . , , , .

, , Aho-Corasick. , "" "" " ". "". Aho-Corasick "" "" .

, Aho-Corasick , .

Regex . , "", "". , . , \b, :

"\b(cat|dog|horse|skunk)\b"

, . , - , . , , , . :

hashTable = Build hash table from target words
for each word in input text
    if word in hashTable then
        output word

, , :

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
    if word in hashTable then
        add word to foundWords

The search algorithm for a list of words in the text

More articles: