The most effective way to index words in a document?

This arose in another question, but I thought it would be better to ask about it as a separate question. Give a large list of offers (about 100 thousand):

[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ] 

What is the best way to code the following function?

 def GetSentences(word1, word2, position): return "" 

where two words are given: word1 , word2 and position position , the function should return a list of all sentences that satisfy this restriction. For example:

 GetSentences("sentence", "another", 3) 

must return sentences 1 and 3 as an index of offers. My current approach used a dictionary like this:

 Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex) 

But this quickly removes all of the proportions in the data set about 130 MB in size, since my 48 gigabyte RAM is used up in less than 5 minutes. I somehow feel that this is a common problem, but cannot find links to how to solve this problem effectively. Any suggestions on how to approach this?

+7
source share
2 answers

Use a database to store values.

  • First, add all sentences to one table (they must have an ID). You can call it, for example. sentences .
  • Secondly, create a table with words contained in all sentences (name it, for example. words , give each word an identifier), keeping the connection between table entries of sentences and table entries of words within (name it, for example sentences_words , it should have two columns, preferably word_id and sentence_id ).
  • When searching for sentences containing all the words mentioned, your work will be simplified:

    • First you should find entries from the words table , where the words are exactly the ones you are looking for. The request may look like this:

       SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3'); 
    • Secondly, you should find the sentence_id values ​​from the sentences table that required the word_id value (corresponding to the words from the words table). The initial request might look like this:

       SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ([here goes list of words' ids]); 

      which could be simplified:

       SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') ); 
    • Filter the result in Python to return only sentence_id tags that have all the necessary word_id identifiers that you need.

This is basically a solution based on storing a large amount of data in the form that is most suitable for this - in the database.

EDIT:

  • If you look for only two words, you can do even more (almost all) on the DBMS side.
  • Given that you also need a position difference, you must save the word position in the third column of the sentences_words table (it can be called simply position ), and when searching for suitable words, you should calculate the difference of this value with both words.
+14
source

This is how I did it in Python. Although it is assumed that this needs to be done several times, the DBMS is the right tool for the job. However, it seems to work very well for me with a million lines.

 sentences = [ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ] sentences = sentences * 200 * 1000 sentencesProcessed = [] def preprocess(): global sentences global sentencesProcessed # may want to do a regex split on whitespace sentencesProcessed = [sentence.split(" ") for sentence in sentences] # can deallocate sentences now sentences = None def GetSentences(word1, word2, position): results = [] for sentenceIndex, sentence in enumerate(sentencesProcessed): for wordIndex, word in enumerate(sentence[:-position]): if word == word1 and sentence[wordIndex + position] == word2: results.append(sentenceIndex) return results def main(): preprocess() results = GetSentences("sentence", "another", 3) print "Got", len(results), "results" if __name__ == "__main__": main() 
+2
source

All Articles