Efficient way to count an element in a dictionary in Python using a loop

I have a list of values. I want to calculate during the cycle the number of elements for each class (i.e. 1,2,3,4,5)

mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5] mydict = dict() for index in mylist: mydict[index] = +1 mydict Out[344]: {1: 1, 2: 1, 3: 1, 4: 1, 5: 1} 

I want to get this result

 Out[344]: {1: 6, 2: 5, 3: 3, 4: 1, 5: 4} 
+7
performance python dictionary coding-style
source share
5 answers

For your smaller example, with a limited variety of elements, you can use a set and understanding of dict:

 >>> mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5] >>> {k:mylist.count(k) for k in set(mylist)} {1: 6, 2: 5, 3: 3, 4: 1, 5: 4} 

To break it up, set(mylist) destroys the list and makes it more compact:

 >>> set(mylist) set([1, 2, 3, 4, 5]) 

Then the understanding of the dictionary goes through unique values ​​and sets the score from the list.

It is also significantly faster than using a counter and faster than using setdefault:

 from __future__ import print_function from collections import Counter from collections import defaultdict import random mylist=[1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]*10 def s1(mylist): return {k:mylist.count(k) for k in set(mylist)} def s2(mlist): return Counter(mylist) def s3(mylist): mydict=dict() for index in mylist: mydict[index] = mydict.setdefault(index, 0) + 1 return mydict def s4(mylist): mydict={}.fromkeys(mylist,0) for k in mydict: mydict[k]=mylist.count(k) return mydict def s5(mylist): mydict={} for k in mylist: mydict[k]=mydict.get(k,0)+1 return mydict def s6(mylist): mydict=defaultdict(int) for i in mylist: mydict[i] += 1 return mydict def s7(mylist): mydict={}.fromkeys(mylist,0) for e in mylist: mydict[e]+=1 return mydict if __name__ == '__main__': import timeit n=1000000 print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n)) print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n)) print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n)) print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n)) print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n)) print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n)) print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n)) 

On my machine that prints (Python 3):

 18.123854104997008 # set and dict comprehension 78.54796334600542 # Counter 33.98185228800867 # setdefault 19.0563529439969 # fromkeys / count 34.54294775899325 # dict.get 21.134678319009254 # defaultdict 22.760544238000875 # fromkeys / loop 

For larger lists, such as 10 million integers, with more diverse elements (1,500 random numbers), use defaultdict or fromkeys in a loop:

 from __future__ import print_function from collections import Counter from collections import defaultdict import random mylist = [random.randint(0,1500) for _ in range(10000000)] def s1(mylist): return {k:mylist.count(k) for k in set(mylist)} def s2(mlist): return Counter(mylist) def s3(mylist): mydict=dict() for index in mylist: mydict[index] = mydict.setdefault(index, 0) + 1 return mydict def s4(mylist): mydict={}.fromkeys(mylist,0) for k in mydict: mydict[k]=mylist.count(k) return mydict def s5(mylist): mydict={} for k in mylist: mydict[k]=mydict.get(k,0)+1 return mydict def s6(mylist): mydict=defaultdict(int) for i in mylist: mydict[i] += 1 return mydict def s7(mylist): mydict={}.fromkeys(mylist,0) for e in mylist: mydict[e]+=1 return mydict if __name__ == '__main__': import timeit n=1 print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n)) print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n)) print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n)) print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n)) print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n)) print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n)) print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n)) 

Print

 2825.2697427899984 # set and dict comprehension 42.607481333994656 # Counter 22.77713537499949 # setdefault 2853.11187016801 # fromkeys / count 23.241977066005347 # dict.get 15.023175164998975 # defaultdict 18.28165417900891 # fromkeys / loop 

You can see that solutions that are relayed to count with a moderate number of times in a large list will suffer / disastrously compared to other solutions.

+13
source share

Try collections.Counter :

  >>> from collections import Counter >>> Counter([1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]) Counter({1: 6, 2: 5, 5: 4, 3: 3, 4: 1}) 

In your code, you can basically replace mydict with Counter and write

 mydict[index] += 1 

instead

 mydict[index] = +1 
+6
source share

The setdefault approach is collections.defaultdict . This is a little faster.

 def foo(mylist): d=defaultdict(int) for i in mylist: d[i] += 1 return d 

itertools.groupBy provides another option. Speed ​​is about the same as Counter (at least 2.7)

 {x[0]:len(list(x[1])) for x in itertools.groupby(sorted(mylist))} 

However, the time tests in this short list of tests may not be the same when working with 32 GB of data that the OP mentions in a comment.


I ran several of these options in case of word counting in python top N word count, why is the multiprocessor slower than one process

There, OP used a counter and tried to speed things up using multiprocessing. With a text file of 1.2 MB, the counter using defaultdict was fast, taking 0.2 seconds. Sorting the output to get the top 40 words took as much as the count.

Counter was a bit slower at 3.2 and much slower at 2.7 . This is because the 3.2 compiled version ( .so file).

But the counter using mylist.count stops when processing a large list; almost 200 sec. He has to search this large list many times, collect the keys once, and then once for each key when he calculates.

+4
source share

To fix the code:

 mydict[index] = +1 

it should be:

 mydict[index] = mydict.setdefault(index, 0) + 1 
+3
source share

Your code assigns a value of 1 for each key. Replace mydict[index] = +1 with mylist.count(index)

This should work:

 mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5] mydict = dict() for index in mylist: mydict[index] = mylist.count(index) mydict 
+1
source share

All Articles