Indicate how many times each line is present in numpy.array

I am trying to count the number that each line shows in np.array , for example:

 import numpy as np my_array = np.array([[1, 2, 0, 1, 1, 1], [1, 2, 0, 1, 1, 1], # duplicate of row 0 [9, 7, 5, 3, 2, 1], [1, 1, 1, 0, 0, 0], [1, 2, 0, 1, 1, 1], # duplicate of row 0 [1, 1, 1, 1, 1, 0]]) 

The string [1, 2, 0, 1, 1, 1] displayed 3 times.

A simple naive solution involves converting all my strings to tuples and using collections.Counter , for example:

 from collections import Counter def row_counter(my_array): list_of_tups = [tuple(ele) for ele in my_array] return Counter(list_of_tups) 

What gives:

 In [2]: row_counter(my_array) Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1}) 

However, I am concerned about the effectiveness of my approach. And perhaps there is a library that provides an integrated way to do this. I marked the question as pandas because I think pandas may have the tool I'm looking for.

+7
python arrays numpy pandas
source share
5 answers

You can use the answer to this other question to get the number of unique elements.

Numpy 1.9 has an optional return_counts keyword argument, so you can just do:

 >>> my_array array([[1, 2, 0, 1, 1, 1], [1, 2, 0, 1, 1, 1], [9, 7, 5, 3, 2, 1], [1, 1, 1, 0, 0, 0], [1, 2, 0, 1, 1, 1], [1, 1, 1, 1, 1, 0]]) >>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1])) >>> b = np.ascontiguousarray(my_array).view(dt) >>> unq, cnt = np.unique(b, return_counts=True) >>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1]) >>> unq array([[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0], [1, 2, 0, 1, 1, 1], [9, 7, 5, 3, 2, 1]]) >>> cnt array([1, 1, 3, 1]) 

In earlier versions, you can do this as:

 >>> unq, _ = np.unique(b, return_inverse=True) >>> cnt = np.bincount(_) >>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1]) >>> unq array([[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0], [1, 2, 0, 1, 1, 1], [9, 7, 5, 3, 2, 1]]) >>> cnt array([1, 1, 3, 1]) 
+9
source share

(This assumes the array is quite small, for example, less than 1000 rows.)

Here's a short NumPy way to count the number of times in each line of each line:

 >>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1) array([3, 3, 1, 1, 3, 1]) 

This counts how many times each row appears in my_array , returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, etc.

+4
source share

You are not a bad decision, but if your matrix is ​​large, you probably want to use a more efficient hash (compared to the default counter) for strings before counting. You can do this with joblib :

 A = np.random.rand(5, 10000) %timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1) 10000 loops, best of 3: 132 µs per loop %timeit Counter(joblib.hash(row) for row in A).values() 1000 loops, best of 3: 1.37 ms per loop %timeit Counter(tuple(ele) for ele in A).values() 100 loops, best of 3: 3.75 ms per loop %timeit pd.DataFrame(A).groupby(range(A.shape[1])).size() 1 loops, best of 3: 2.24 s per loop 

The pandas solution is extremely slow (about 2s per cycle) with this many columns. For a small matrix like the one you showed, your method is faster than joblib hashing, but slower than numpy:

 numpy: 100000 loops, best of 3: 15.1 µs per loop joblib:1000 loops, best of 3: 885 µs per loop tuple: 10000 loops, best of 3: 27 µs per loop pandas: 100 loops, best of 3: 2.2 ms per loop 

If you have a large number of lines, you can probably find the best Counter replacement for searching hash frequencies.

Edit : Added numpy tests from @acjr solution on my system to make comparison easier. The numpy solution is the fastest in both cases.

+3
source share
An approach

A pandas might look like this:

 import pandas as pd df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6']) df.groupby(['c1','c2','c3','c4','c5','c6']).size() 

Note: delivery column names are not needed.

+2
source share

A solution identical to Jaime can be found in the numpy_indexed package (disclaimer: I am the author)

 import numpy_indexed as npi npi.count(my_array) 
0
source share

All Articles