One step to improving speed is the optimal choice of meter.
For example, with your txt (802 char).
mycounter=Counter(txt.split())
creates the same as your word_counter , but 1/3 times.
Or, if you need to read text line by line from a file, use:
word_freq=Counter() for line in txt.splitlines(): word_freq.update(line.split())
Similarly, an ordered dictionary can be created without a loop:
mydict = OrderedDict(sorted(mycounter.items(), key=operator.itemgetter(1), reverse=True))
Here I call sorted in the same way as most_common (via your link). And I pass the list of sorted items directly to the OrderedDict creator.
When I look at mycounter in ipython , I get the values in sorted order:
In [160]: mycounter Out[160]: Counter({'the': 13, ',': 10, 'of': 9, 'a': 7, '.': 4, 'in': 4, 'to': 3, 'have': 3, 'session': 3, ''': 3, 'on': 3, 'you': 3, 'I': 3, 'that': 2, 'requested': 2, 'like': 2, 'European': 2, 'this': 2, 'countries': 2, 'as': 2, 'number': 2, 's': 1, 'various': 1, 'wish': 1, 'will': 1, 'Parliament': 1, 'meantime': 1, 'Resumption': 1, 'natural': 1, 'days': 1, 'debate': 1, 'You': 1, 'Members': 1, 'next': 1, '@ -@ ': 1, 'hope': 1, 'enjoyed': 1, 'December': 1, 'victims': 1, 'particularly': 1, 'millennium': 1, .... 'behalf': 1, 'were': 1, 'failed': 1})
This is because its __repr__ method calls most_common . Again this is from your link.
items = ', '.join(map('%r: %r'.__mod__, self.most_common()))
With further testing, I see that calling sorted directly does not save time:
In [166]: timeit mycounter.most_common() 10000 loops, best of 3: 31.1 µs per loop In [167]: timeit sorted(mycounter.items(),key=operator.itemgetter(1),reverse=True) 10000 loops, best of 3: 30.5 µs per loop In [168]: timeit OrderedDict(mycounter.most_common()) 1000 loops, best of 3: 225 µs per loop
In this case, loading the dictionary directly does not save time. Your iteration does the same:
In [174]: %%timeit .....: sorteddict=OrderedDict() .....: for word,freq in word_freq.most_common(): sorteddict[word]=freq .....: 1000 loops, best of 3: 224 µs per loop
For this example, using np.argsort does not help (in time). Just calling argsort is slower than most_common .
In [178]: timeit np.argsort(list(mycounter.values())) 10000 loops, best of 3: 34.2 µs per loop
In most cases, this is converting the list to an array, x=np.array(list(mycounter.values())) . np.argsort(x) is much faster. This applies to many numpy functions. When working on arrays, numpy is fast. But when converting lists to arrays there is a lot of overhead.
I can create an OrderedDict via numpy on one line with:
OrderedDict(np.sort(np.array(list(mycounter.items()), dtype='a12,i'), order='f1')[::-1])
or in pieces:
lla = np.array(list(mycounter.items()),dtype='a12,i') lla.sort(order='f1') OrderedDict(lla[::-1])
I make a structured array from items() , sorting it by the second field, and then creating a dictionary. However, there is no time saving. See fooobar.com/questions/1227937 / ... for another recent example of using order to sort a structured array.