Fastest way to sort corpus dictionary in OrderedDict - python

Given the body / text as such:

Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period . Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful . You have requested a debate on this subject in the course of the next few days , during this part @ -@ session . In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union . 

I could just do this to get a dictionary with frequency layers:

 >>> word_freq = Counter() >>> for line in text.split('\n'): ... for word in line.split(): ... word_freq[word]+=1 ... 

But if the goal is to achieve an ordered dictionary from the highest to the lowest frequency, I will have to do this:

 >>> from collections import OrderedDict >>> sorted_word_freq = OrderedDict() >>> for word, freq in word_freq.most_common(): ... sorted_word_freq[word] = freq ... 

Suppose I have 1 billion keys in a Counter object, iterating through most_common() will have the difficulty of passing through the body (not unique instances) once and the vocabulary (unique key).

Note: Counter.most_common() will call ad-hoc sorted() , see https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472

Given this, I saw the following code that uses numpy.argsort() :

 >>> import numpy as np >>> words = word_freq.keys() >>> freqs = word_freq.values() >>> sorted_word_index = np.argsort(freqs) # lowest to highest >>> sorted_word_freq_with_numpy = OrderedDict() >>> for idx in reversed(sorted_word_index): ... sorted_word_freq_with_numpy[words[idx]] = freqs[idx] ... 

Which is faster?

Is there any other faster way to get such an OrderedDict from Counter ?

Other OrderedDict , are there other python objects that reach the same sorted key-value pair?

Assume memory is not a problem. Given 120 GB of RAM, there shouldn't be a lot of problems to save 1 billion key-value pairs correctly? Assume an average of 20 characters per key for 1 billion keys and one integer for each value.

+5
source share
2 answers

The Pandas Series object is an array of key-value pairs (which may have non-unique keys) that may be of interest. It has a sort method that sorts by values ​​and is implemented in Cython. Here's an example of sorting an array of one million long:

 In [39]: import pandas as pd import numpy as np arr = np.arange(1e6) np.random.shuffle(arr) s = pd.Series(arr, index=np.arange(1e6)) %timeit s.sort() %timeit sorted(arr) 1 loops, best of 3: 85.8 ms per loop 1 loops, best of 3: 1.15 s per loop 

For a regular Python dict you can build a Series by calling:

 my_series = pd.Series(my_dict) 

Then sort by value

 my_series.sort() 
+3
source

One step to improving speed is the optimal choice of meter.

For example, with your txt (802 char).

 mycounter=Counter(txt.split()) 

creates the same as your word_counter , but 1/3 times.

Or, if you need to read text line by line from a file, use:

 word_freq=Counter() for line in txt.splitlines(): word_freq.update(line.split()) 

Similarly, an ordered dictionary can be created without a loop:

 mydict = OrderedDict(sorted(mycounter.items(), key=operator.itemgetter(1), reverse=True)) 

Here I call sorted in the same way as most_common (via your link). And I pass the list of sorted items directly to the OrderedDict creator.

When I look at mycounter in ipython , I get the values ​​in sorted order:

 In [160]: mycounter Out[160]: Counter({'the': 13, ',': 10, 'of': 9, 'a': 7, '.': 4, 'in': 4, 'to': 3, 'have': 3, 'session': 3, ''': 3, 'on': 3, 'you': 3, 'I': 3, 'that': 2, 'requested': 2, 'like': 2, 'European': 2, 'this': 2, 'countries': 2, 'as': 2, 'number': 2, 's': 1, 'various': 1, 'wish': 1, 'will': 1, 'Parliament': 1, 'meantime': 1, 'Resumption': 1, 'natural': 1, 'days': 1, 'debate': 1, 'You': 1, 'Members': 1, 'next': 1, '@ -@ ': 1, 'hope': 1, 'enjoyed': 1, 'December': 1, 'victims': 1, 'particularly': 1, 'millennium': 1, .... 'behalf': 1, 'were': 1, 'failed': 1}) 

This is because its __repr__ method calls most_common . Again this is from your link.

 items = ', '.join(map('%r: %r'.__mod__, self.most_common())) 

With further testing, I see that calling sorted directly does not save time:

 In [166]: timeit mycounter.most_common() 10000 loops, best of 3: 31.1 µs per loop In [167]: timeit sorted(mycounter.items(),key=operator.itemgetter(1),reverse=True) 10000 loops, best of 3: 30.5 µs per loop In [168]: timeit OrderedDict(mycounter.most_common()) 1000 loops, best of 3: 225 µs per loop 

In this case, loading the dictionary directly does not save time. Your iteration does the same:

 In [174]: %%timeit .....: sorteddict=OrderedDict() .....: for word,freq in word_freq.most_common(): sorteddict[word]=freq .....: 1000 loops, best of 3: 224 µs per loop 

For this example, using np.argsort does not help (in time). Just calling argsort is slower than most_common .

 In [178]: timeit np.argsort(list(mycounter.values())) 10000 loops, best of 3: 34.2 µs per loop 

In most cases, this is converting the list to an array, x=np.array(list(mycounter.values())) . np.argsort(x) is much faster. This applies to many numpy functions. When working on arrays, numpy is fast. But when converting lists to arrays there is a lot of overhead.

I can create an OrderedDict via numpy on one line with:

 OrderedDict(np.sort(np.array(list(mycounter.items()), dtype='a12,i'), order='f1')[::-1]) 

or in pieces:

 lla = np.array(list(mycounter.items()),dtype='a12,i') lla.sort(order='f1') OrderedDict(lla[::-1]) 

I make a structured array from items() , sorting it by the second field, and then creating a dictionary. However, there is no time saving. See fooobar.com/questions/1227937 / ... for another recent example of using order to sort a structured array.

+2
source

All Articles