import collections def countWordDistances(li): wordmap = collections.defaultdict(list) for i, w in enumerate(li, 1): wordmap[w].append(i) for k, v in wordmap.iteritems(): wordmap[k] = sum(v)/float(len(v)) return wordmap
This makes only one pass through the list and minimizes operations. I dated it to a list of words with 1.1M elements, 29k unique words, and it was almost twice as fast as Patrick's answer. In a list of 10 thousand words, 2k unique, it was more than 300 times faster than the OP code.
To make Python code faster, you need to keep two rules in mind: use the best algorithm and avoid Python.
At the front of the algorithm, renaming the list once instead of N + 1 times (N = number of unique words) is the main thing that will speed it up.
On the “avoid Python” front, I mean: you want your code to run in C as much as possible. Therefore, using defaultdict better than dict, where you explicitly check for a key. defaultdict checks this, but does it in C, in a Python implementation. enumerate better than for i in range(len(li)) , again because there are fewer Python steps. And enumerate(li, 1) makes the origin at 1 instead of having python +1 somewhere in the loop.
Edited: Third Rule: Use PyPy. My code runs twice as fast on PyPy than 2.7.
Ned batchelder
source share