How to make this list function faster?

Question

How to make this list function faster?

def removeDuplicatesFromList(seq): # Not order preserving keys = {} for e in seq: keys[e] = 1 return keys.keys() def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' wordmap = {} unique_words = removeDuplicatesFromList(li) for w in unique_words: distances = [i+1 for i,x in enumerate(li) if x == w] wordmap[w] = float(sum(distances)) / float(len(distances)) #take average return wordmap

How to make this function faster?

+7

optimization python dictionary list algorithm

user849364 Jul 18 '11 at 4:49

source share

8 answers

Ned batchelder · Answer 1 · 2011-07-18T11:31:08+0000

 import collections def countWordDistances(li): wordmap = collections.defaultdict(list) for i, w in enumerate(li, 1): wordmap[w].append(i) for k, v in wordmap.iteritems(): wordmap[k] = sum(v)/float(len(v)) return wordmap

This makes only one pass through the list and minimizes operations. I dated it to a list of words with 1.1M elements, 29k unique words, and it was almost twice as fast as Patrick's answer. In a list of 10 thousand words, 2k unique, it was more than 300 times faster than the OP code.

To make Python code faster, you need to keep two rules in mind: use the best algorithm and avoid Python.

At the front of the algorithm, renaming the list once instead of N + 1 times (N = number of unique words) is the main thing that will speed it up.

On the “avoid Python” front, I mean: you want your code to run in C as much as possible. Therefore, using defaultdict better than dict, where you explicitly check for a key. defaultdict checks this, but does it in C, in a Python implementation. enumerate better than for i in range(len(li)) , again because there are fewer Python steps. And enumerate(li, 1) makes the origin at 1 instead of having python +1 somewhere in the loop.

Edited: Third Rule: Use PyPy. My code runs twice as fast on PyPy than 2.7.

Neil g · Answer 2 · 2011-07-18T17:04:56+0000

Based on @Ned Batchelder's solution, but without creating dummy lists:

 import collections def countWordDistances(li): wordmap = collections.defaultdict(lambda:[0.0, 0.0]) for i, w in enumerate(li, 1): wordmap[w][0] += i wordmap[w][1] += 1.0 for k, (t, n) in wordmap.iteritems(): wordmap[k] = t / n return wordmap

Patrick · Answer 3 · 2011-07-18T10:32:40+0000

I'm not sure if this will be faster than using a set, but this only requires one pass through the list:

 def countWordDistances(li): wordmap = {} for i in range(len(li)): if li[i] in wordmap: avg, num = wordmap[li[i]] new_avg = avg*(num/(num+1.0)) + (1.0/(num+1.0))*i wordmap[li[i]] = new_avg, num+1 else: wordmap[li[i]] = (i, 1) return wordmap

Returns a modified version of wordmap, with the values associated with each key being a tuple of middle position and number of occurrences. Obviously, you can easily convert this to the form of the original output, but it will take some time.

Basically, the code stores the average value when iterates through the list, recounting each time, taking a weighted average value.

OneOfOne · Answer 4 · 2011-07-18T04:53:00+0000

Use the kit:

 def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' wordmap = {} unique_words = set(li) for w in unique_words: distances = [i+1 for i,x in enumerate(li) if x == w] wordmap[w] = float(sum(distances)) / float(len(distances)) #take average return wordmap

Blair · Answer 5 · 2011-07-18T04:53:35+0000

The first thing that comes to mind is to use a set to remove duplicate words:

 unique_words = set(li)

In general, though, if you are worried about speed, you need to profile the function to see where the bottleneck is, and then try to reduce that bottleneck.

Adam rosenfield · Answer 6 · 2011-07-18T04:53:42+0000

Use frozenset instead of dict , since you are not doing anything with values:

 def removeDuplicatesFromList(seq): return frozenset(seq)

riza · Answer 7 · 2011-07-18T07:44:00+0000

List Usage:

 def countWordDistances(l): unique_words = set(l) idx = [[i for i,x in enumerate(l) if x==item] for item in unique_words] return {item:1.*sum(idx[i])/len(idx[i]) + 1. for i,item in enumerate(unique_words)} li = ['that','sank','into','the','ocean'] countWordDistances(li) # {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 1.0, 'the': 4.0} li2 = ['that','sank','into','the','ocean', 'that'] countWordDistances(li2) # {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 3.5, 'the': 4.0}

Pushpak dagade · Answer 8 · 2011-07-18T06:47:57+0000

Oneliner -

 from __future__ import division # no need for this if using py3k def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] This function would return: { that:1, sank:2, into:3, the:4, ocean:5 } However, if there is a duplicate term, take the average of their positions ''' return {w:sum(dist)/len(dist) for w,dist in zip(set(li), ([i+1 for i,x in enumerate(li) if x==w] for w in set(li))) }

What I'm doing on the last line is a dictionary comprehension, similar to a list comprehension.

How to make this list function faster?

More articles: