Get non duplicated rows from numpy array

Let's say I have an array of numpy forms

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

What I would like to get from this is an array that contains only rows that are NOT duplicated, i.e. I expect from this example

array([[4, 5],
       [1, 4]])

I am looking for a method that scales fairly quickly and well. The only way I can do this is

  • First find a set of unique strings in x, like a new array y.
  • Create a new array zthat has those individual elements yremoved from x, so this zis a list of duplicated rows in x.
  • Make a difference between xand z.

It seems terribly inefficient. Does anyone have a better way?

, , , [5, 2] [3, 1].

+4
5

№ 1

, np.unique ( , ) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

.. , lid , np.cumprod, -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

# 2

, np.bincount, -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

№ 3

, convolution, . . -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
+3

pandas:

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])
+2

( , ) , :

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

, . , :

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

, :

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

, !=2 :

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])
+1

For completeness, see also paragraph 78 at http://www.labri.fr/perso/nrougier/teaching/numpy.100/

+1
source

This problem can be effectively resolved using the numpy_indexed package (disclaimer: I am the author):

import numpy_indexed as npi
x[npi.multiplicity(x) == 1]

Not only is this solution very readable, it is also very efficient and works with any number of columns or types.

+1
source

All Articles