How to make this list function faster?

def removeDuplicatesFromList(seq): # Not order preserving keys = {} for e in seq: keys[e] = 1 return keys.keys() def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' wordmap = {} unique_words = removeDuplicatesFromList(li) for w in unique_words: distances = [i+1 for i,x in enumerate(li) if x == w] wordmap[w] = float(sum(distances)) / float(len(distances)) #take average return wordmap 

How to make this function faster?

+7
source share
8 answers
 import collections def countWordDistances(li): wordmap = collections.defaultdict(list) for i, w in enumerate(li, 1): wordmap[w].append(i) for k, v in wordmap.iteritems(): wordmap[k] = sum(v)/float(len(v)) return wordmap 

This makes only one pass through the list and minimizes operations. I dated it to a list of words with 1.1M elements, 29k unique words, and it was almost twice as fast as Patrick's answer. In a list of 10 thousand words, 2k unique, it was more than 300 times faster than the OP code.

To make Python code faster, you need to keep two rules in mind: use the best algorithm and avoid Python.

At the front of the algorithm, renaming the list once instead of N + 1 times (N = number of unique words) is the main thing that will speed it up.

On the “avoid Python” front, I mean: you want your code to run in C as much as possible. Therefore, using defaultdict better than dict, where you explicitly check for a key. defaultdict checks this, but does it in C, in a Python implementation. enumerate better than for i in range(len(li)) , again because there are fewer Python steps. And enumerate(li, 1) makes the origin at 1 instead of having python +1 somewhere in the loop.

Edited: Third Rule: Use PyPy. My code runs twice as fast on PyPy than 2.7.

+15
source

Based on @Ned Batchelder's solution, but without creating dummy lists:

 import collections def countWordDistances(li): wordmap = collections.defaultdict(lambda:[0.0, 0.0]) for i, w in enumerate(li, 1): wordmap[w][0] += i wordmap[w][1] += 1.0 for k, (t, n) in wordmap.iteritems(): wordmap[k] = t / n return wordmap 
+5
source

I'm not sure if this will be faster than using a set, but this only requires one pass through the list:

 def countWordDistances(li): wordmap = {} for i in range(len(li)): if li[i] in wordmap: avg, num = wordmap[li[i]] new_avg = avg*(num/(num+1.0)) + (1.0/(num+1.0))*i wordmap[li[i]] = new_avg, num+1 else: wordmap[li[i]] = (i, 1) return wordmap 

Returns a modified version of wordmap, with the values ​​associated with each key being a tuple of middle position and number of occurrences. Obviously, you can easily convert this to the form of the original output, but it will take some time.

Basically, the code stores the average value when iterates through the list, recounting each time, taking a weighted average value.

+3
source

Use the kit:

 def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' wordmap = {} unique_words = set(li) for w in unique_words: distances = [i+1 for i,x in enumerate(li) if x == w] wordmap[w] = float(sum(distances)) / float(len(distances)) #take average return wordmap 
+1
source

The first thing that comes to mind is to use a set to remove duplicate words:

 unique_words = set(li) 

In general, though, if you are worried about speed, you need to profile the function to see where the bottleneck is, and then try to reduce that bottleneck.

+1
source

Use frozenset instead of dict , since you are not doing anything with values:

 def removeDuplicatesFromList(seq): return frozenset(seq) 
+1
source

List Usage:

 def countWordDistances(l): unique_words = set(l) idx = [[i for i,x in enumerate(l) if x==item] for item in unique_words] return {item:1.*sum(idx[i])/len(idx[i]) + 1. for i,item in enumerate(unique_words)} li = ['that','sank','into','the','ocean'] countWordDistances(li) # {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 1.0, 'the': 4.0} li2 = ['that','sank','into','the','ocean', 'that'] countWordDistances(li2) # {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 3.5, 'the': 4.0} 
0
source

Oneliner -

 from __future__ import division # no need for this if using py3k def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' return {w:sum(dist)/len(dist) for w,dist in zip(set(li), ([i+1 for i,x in enumerate(li) if x==w] for w in set(li))) } 

What I'm doing on the last line is a dictionary comprehension, similar to a list comprehension.

-one
source

All Articles