You must change headwordList to set .
The word in headwordList will be very slow. It should perform string comparisons for each word in headwordList , one word at a time. It takes time proportional to the length of the list; if you double the length of the list, you double the time it takes to complete the test (on average).
With set in test takes the same amount of time; it does not depend on the number of elements in set . So it will be a tremendous speed.
Now this whole cycle can be simplified:
for x in headwordList: m = SequenceMatcher(None, y.lower(), x) if m.ratio() > percentage: percentage = m.ratio() word = x if percentage > 0.86: sentenceList[count] = word
All this allows you to find the word from headwordList that has the highest ratio and save it (but only save it if the ratio exceeds 0.86). Here's a faster way to do this. I am going to change the name headwordList only to headwords , since I want you to set , not list .
def check_ratio(m): return m.ratio() y = y.lower()
This may seem a bit complicated, but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that is most relevant. First, we create a “generator expression” that tries all the words in headwords , calling SequenceMatcher() for each. But when we are done, we also want to know what that word is. Thus, the generator expression creates tuples, where the first value in the tuple is the result of the SequenceMatcher and the second value is the word. The max() function cannot know that we care about the relationship, so we must say this; we do this by creating a function that checks that we care, and then passing this function as an argument to key= . Now max() finds the value with the highest coefficient for us. max() consumes all the values created by the generator expression and returns a single value, which is then unpacked into the m and word variables.
In Python, it is better to use variable names, such as sentence_list , rather than sentenceList . Please check out these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not recommended to use an incremental index variable and assign indexed positions in the list. Rather, start with an empty list and use the .append() method function to add values.
In addition, you better create a dictionary of words and their correlation.
Note that your source code seems to have an error: as soon as any word has a percentage greater than 0.86, all words are stored in a sentenceList regardless of their relationship. The code I wrote above only stores words where the proper word ratio was high enough.
EDIT: This is the answer to the question about the expression of the generator, which should be enclosed in brackets.
Whenever I get this error message, I usually separate the generator expression myself and assign it to a variable. Like this:
def check_ratio(m): return m.ratio() y = y.lower()
This is what I suggest. But if you don't mind a complicated line that looks even more busy, you can simply add an extra pair of parentheses, as the error message suggests, so the generator expression is completely enclosed in parentheses. For example:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python allows you to skip explicit parentheses around a generator expression when passing an expression to a function, but only if that is the only argument to that function. Since we also pass the key= argument, we need the full expression in parentheses.
But it seems to me that reading is easier if you split genexp on your own line.
EDIT: @Peter Wood pointed out that the documentation suggests reusing SequenceMatcher for speed. I don't have time to check this out, but I think this is the right way to do this.
Fortunately, the code has become easier! Always a good sign.
EDIT: I just checked the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher headwords = [ # This is a list of 650,000 words # Dummy list: "happy", "new", "year", ] def words_from_file(filename): with open(filename, "rt") as f: for line in f: for word in line.split(): yield word def _match(matcher, s): matcher.set_seq2(s) return (matcher.ratio(), s) ratios = {} best_ratio = 0 matcher = SequenceMatcher() for word in words_from_file("sentences.txt"): matcher.set_seq1(word.lower()) if word not in headwords: ratio, word = max(_match(matcher, word.lower()) for word in headwords) best_ratio = max(best_ratio, ratio) # remember best ratio if ratio > 0.86: ratios[word] = ratio print(best_ratio) print(ratios)