I have a list of words, pretty small about 1000 or so. I want to check if any of the words in this list are in the input text. If so, I would like to know which ones are happening. The input text is several hundred words each, and these are text paragraphs from the Internet, which means that there are a lot of them from different sites. I am trying to find the best algorithm for it.
I see two obvious ways to do this -
The local way to search for each word from a list in the text.
Create a hash table of words from the input text, and then search for each word in the list in the hash table. It is fast.
Is there a better solution?
I use python, although I'm not sure if this will change the algorithm anyway.
As well as optimizing solution 2 above, I would like to save the hash table generated in persistent storage (DB) so that if the list of words changes, I can reuse the hash table without creating it again, Of course, if the input text changes, I need to create a hash table. Is it possible to save a hash table in the database? Any recommendations? I am currently using MongoDB for my project, and I can only store json documents. I am new to MongoDB and have just started working with it and still do not fully understand its full potential.
I searched SO and I see two questions on similar lines, and one of them offers a hash table, but I would like to get any pointers to the optimization that I have in mind.
Here are the previously asked questions about SO -
Is there an efficient algorithm for performing inverted full-text search?
EDIT: SO, .
, , -. , , . , , ?