Python is looking for faster list speed

I ran into a speed issue while browsing through a very large list. I have a file with a lot of errors and very strange words. I am trying to use difflib to find the closest match in a dictionary file, which I have 650,000 words. This approach below works very well, but very slow, and I was wondering if there is a better way to approach this problem. This is the code:

from difflib import SequenceMatcher headWordList = [ #This is a list of 650,000 words] openFile = open("sentences.txt","r") for line in openFile: sentenceList.append[line] percentage = 0 count = 0 for y in sentenceList: if y not in headwordList: for x in headwordList: m = SequenceMatcher(None, y.lower(), x) if m.ratio() > percentage: percentage = m.ratio() word = x if percentage > 0.86: sentenceList[count] = word count=count+1 

Thanks for the help, software development doesn't even come close to my strong suit. Very much appreciated.

+8
optimization python list memory text
source share
4 answers

Two things that may be of little help:

1) Use the approach in this SO answer to most effectively read your large file.

2) Change your code with

 for x in headwordList: m = SequenceMatcher(None, y.lower(), 1) 

to

 yLower = y.lower() for x in headwordList: m = SequenceMatcher(None, yLower, 1) 

You convert every offer more than 650,000 times. No need for this.

+4
source share

1) I would save headwordList as a collection, not a list, providing faster access, since it is a hashed data structure.

2) You have a sentenceList defined as a list, then try to use it as a dictionary with sentenceList[x] = y . I would define a different structure specifically for calculations.

3) You create a sentenceList that does not need to be done.

 for line in file: if line not in headwordList... 

4) You never sign a line , which means that you save the entire line before the newline in listList and see if it is in the word list

+3
source share

You must change headwordList to set .

The word in headwordList will be very slow. It should perform string comparisons for each word in headwordList , one word at a time. It takes time proportional to the length of the list; if you double the length of the list, you double the time it takes to complete the test (on average).

With set in test takes the same amount of time; it does not depend on the number of elements in set . So it will be a tremendous speed.

Now this whole cycle can be simplified:

  for x in headwordList: m = SequenceMatcher(None, y.lower(), x) if m.ratio() > percentage: percentage = m.ratio() word = x if percentage > 0.86: sentenceList[count] = word 

All this allows you to find the word from headwordList that has the highest ratio and save it (but only save it if the ratio exceeds 0.86). Here's a faster way to do this. I am going to change the name headwordList only to headwords , since I want you to set , not list .

 def check_ratio(m): return m.ratio() y = y.lower() # do the .lower() call one time m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio) percentage = max(percentage, m.ratio()) # remember best ratio if m.ratio() > 0.86: setence_list.append(word) 

This may seem a bit complicated, but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that is most relevant. First, we create a “generator expression” that tries all the words in headwords , calling SequenceMatcher() for each. But when we are done, we also want to know what that word is. Thus, the generator expression creates tuples, where the first value in the tuple is the result of the SequenceMatcher and the second value is the word. The max() function cannot know that we care about the relationship, so we must say this; we do this by creating a function that checks that we care, and then passing this function as an argument to key= . Now max() finds the value with the highest coefficient for us. max() consumes all the values ​​created by the generator expression and returns a single value, which is then unpacked into the m and word variables.

In Python, it is better to use variable names, such as sentence_list , rather than sentenceList . Please check out these guidelines: http://www.python.org/dev/peps/pep-0008/

It is not recommended to use an incremental index variable and assign indexed positions in the list. Rather, start with an empty list and use the .append() method function to add values.

In addition, you better create a dictionary of words and their correlation.

Note that your source code seems to have an error: as soon as any word has a percentage greater than 0.86, all words are stored in a sentenceList regardless of their relationship. The code I wrote above only stores words where the proper word ratio was high enough.

EDIT: This is the answer to the question about the expression of the generator, which should be enclosed in brackets.

Whenever I get this error message, I usually separate the generator expression myself and assign it to a variable. Like this:

 def check_ratio(m): return m.ratio() y = y.lower() # do the .lower() call one time genexp = ((SequenceMatcher(None, y, word), word) for word in headwords) m, word = max(genexp, key=check_ratio) percentage = max(percentage, m.ratio()) # remember best ratio if m.ratio() > 0.86: setence_list.append(word) 

This is what I suggest. But if you don't mind a complicated line that looks even more busy, you can simply add an extra pair of parentheses, as the error message suggests, so the generator expression is completely enclosed in parentheses. For example:

 m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio) 

Python allows you to skip explicit parentheses around a generator expression when passing an expression to a function, but only if that is the only argument to that function. Since we also pass the key= argument, we need the full expression in parentheses.

But it seems to me that reading is easier if you split genexp on your own line.

EDIT: @Peter Wood pointed out that the documentation suggests reusing SequenceMatcher for speed. I don't have time to check this out, but I think this is the right way to do this.

Fortunately, the code has become easier! Always a good sign.

EDIT: I just checked the code. This code works for me; see if it works for you.

 from difflib import SequenceMatcher headwords = [ # This is a list of 650,000 words # Dummy list: "happy", "new", "year", ] def words_from_file(filename): with open(filename, "rt") as f: for line in f: for word in line.split(): yield word def _match(matcher, s): matcher.set_seq2(s) return (matcher.ratio(), s) ratios = {} best_ratio = 0 matcher = SequenceMatcher() for word in words_from_file("sentences.txt"): matcher.set_seq1(word.lower()) if word not in headwords: ratio, word = max(_match(matcher, word.lower()) for word in headwords) best_ratio = max(best_ratio, ratio) # remember best ratio if ratio > 0.86: ratios[word] = ratio print(best_ratio) print(ratios) 
+3
source share

This is a data structure issue. What you want to do is to turn your list into something with a faster element search speed, for example, a binary search tree will work fine: time complexity is only O (log n) , not O (n) on the list (which is compared incredibly fast).

There is a fairly simple explanation here:

http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html

But if you are not familiar with the concepts of a tree, you can start a few chapters first:

http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

0
source share

All Articles