Fastest way to sort corpus dictionary in OrderedDict - python

Question

Fastest way to sort corpus dictionary in OrderedDict - python

Given the body / text as such:

Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period . Although , as you will have seen , the dreaded &apos; millennium bug &apos; failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful . You have requested a debate on this subject in the course of the next few days , during this part @ -@ session . In the meantime , I should like to observe a minute &apos; s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .

I could just do this to get a dictionary with frequency layers:

 >>> word_freq = Counter() >>> for line in text.split('\n'): ... for word in line.split(): ... word_freq[word]+=1 ...

But if the goal is to achieve an ordered dictionary from the highest to the lowest frequency, I will have to do this:

 >>> from collections import OrderedDict >>> sorted_word_freq = OrderedDict() >>> for word, freq in word_freq.most_common(): ... sorted_word_freq[word] = freq ...

Suppose I have 1 billion keys in a Counter object, iterating through most_common() will have the difficulty of passing through the body (not unique instances) once and the vocabulary (unique key).

Note: Counter.most_common() will call ad-hoc sorted() , see https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472

Given this, I saw the following code that uses numpy.argsort() :

 >>> import numpy as np >>> words = word_freq.keys() >>> freqs = word_freq.values() >>> sorted_word_index = np.argsort(freqs) # lowest to highest >>> sorted_word_freq_with_numpy = OrderedDict() >>> for idx in reversed(sorted_word_index): ... sorted_word_freq_with_numpy[words[idx]] = freqs[idx] ...

Which is faster?

Is there any other faster way to get such an OrderedDict from Counter ?

Other OrderedDict , are there other python objects that reach the same sorted key-value pair?

Assume memory is not a problem. Given 120 GB of RAM, there shouldn't be a lot of problems to save 1 billion key-value pairs correctly? Assume an average of 20 characters per key for 1 billion keys and one integer for each value.

+5

python dictionary numpy counter ordereddictionary

alvas Aug 2 '15 at 10:45

source share

2 answers

One step to improving speed is the optimal choice of meter.

For example, with your txt (802 char).

 mycounter=Counter(txt.split())

creates the same as your word_counter , but 1/3 times.

Or, if you need to read text line by line from a file, use:

 word_freq=Counter() for line in txt.splitlines(): word_freq.update(line.split())

Similarly, an ordered dictionary can be created without a loop:

 mydict = OrderedDict(sorted(mycounter.items(), key=operator.itemgetter(1), reverse=True))

Here I call sorted in the same way as most_common (via your link). And I pass the list of sorted items directly to the OrderedDict creator.

When I look at mycounter in ipython , I get the values in sorted order:

 In [160]: mycounter Out[160]: Counter({'the': 13, ',': 10, 'of': 9, 'a': 7, '.': 4, 'in': 4, 'to': 3, 'have': 3, 'session': 3, '&apos;': 3, 'on': 3, 'you': 3, 'I': 3, 'that': 2, 'requested': 2, 'like': 2, 'European': 2, 'this': 2, 'countries': 2, 'as': 2, 'number': 2, 's': 1, 'various': 1, 'wish': 1, 'will': 1, 'Parliament': 1, 'meantime': 1, 'Resumption': 1, 'natural': 1, 'days': 1, 'debate': 1, 'You': 1, 'Members': 1, 'next': 1, '@ -@ ': 1, 'hope': 1, 'enjoyed': 1, 'December': 1, 'victims': 1, 'particularly': 1, 'millennium': 1, .... 'behalf': 1, 'were': 1, 'failed': 1})

This is because its __repr__ method calls most_common . Again this is from your link.

 items = ', '.join(map('%r: %r'.__mod__, self.most_common()))

With further testing, I see that calling sorted directly does not save time:

 In [166]: timeit mycounter.most_common() 10000 loops, best of 3: 31.1 µs per loop In [167]: timeit sorted(mycounter.items(),key=operator.itemgetter(1),reverse=True) 10000 loops, best of 3: 30.5 µs per loop In [168]: timeit OrderedDict(mycounter.most_common()) 1000 loops, best of 3: 225 µs per loop

In this case, loading the dictionary directly does not save time. Your iteration does the same:

 In [174]: %%timeit .....: sorteddict=OrderedDict() .....: for word,freq in word_freq.most_common(): sorteddict[word]=freq .....: 1000 loops, best of 3: 224 µs per loop

For this example, using np.argsort does not help (in time). Just calling argsort is slower than most_common .

 In [178]: timeit np.argsort(list(mycounter.values())) 10000 loops, best of 3: 34.2 µs per loop

In most cases, this is converting the list to an array, x=np.array(list(mycounter.values())) . np.argsort(x) is much faster. This applies to many numpy functions. When working on arrays, numpy is fast. But when converting lists to arrays there is a lot of overhead.

I can create an OrderedDict via numpy on one line with:

 OrderedDict(np.sort(np.array(list(mycounter.items()), dtype='a12,i'), order='f1')[::-1])

or in pieces:

 lla = np.array(list(mycounter.items()),dtype='a12,i') lla.sort(order='f1') OrderedDict(lla[::-1])

I make a structured array from items() , sorting it by the second field, and then creating a dictionary. However, there is no time saving. See fooobar.com/questions/1227937 / ... for another recent example of using order to sort a structured array.

+2

hpaulj Aug 08 '15 at 16:51

source share

Joecondron · Accepted Answer · 2015-08-08T17:42:53+0000

The Pandas Series object is an array of key-value pairs (which may have non-unique keys) that may be of interest. It has a sort method that sorts by values and is implemented in Cython. Here's an example of sorting an array of one million long:

 In [39]: import pandas as pd import numpy as np arr = np.arange(1e6) np.random.shuffle(arr) s = pd.Series(arr, index=np.arange(1e6)) %timeit s.sort() %timeit sorted(arr) 1 loops, best of 3: 85.8 ms per loop 1 loops, best of 3: 1.15 s per loop

For a regular Python dict you can build a Series by calling:

 my_series = pd.Series(my_dict)

Then sort by value

 my_series.sort()

Fastest way to sort corpus dictionary in OrderedDict - python

More articles: