This arose in another question, but I thought it would be better to ask about it as a separate question. Give a large list of offers (about 100 thousand):
[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ]
What is the best way to code the following function?
def GetSentences(word1, word2, position): return ""
where two words are given: word1 , word2 and position position , the function should return a list of all sentences that satisfy this restriction. For example:
GetSentences("sentence", "another", 3)
must return sentences 1 and 3 as an index of offers. My current approach used a dictionary like this:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex)
But this quickly removes all of the proportions in the data set about 130 MB in size, since my 48 gigabyte RAM is used up in less than 5 minutes. I somehow feel that this is a common problem, but cannot find links to how to solve this problem effectively. Any suggestions on how to approach this?
Legend
source share