The difference in the matrix row, displays a logical vector

Question

The difference in the matrix row, displays a logical vector

I have a matrix m x 3 Aand its subset of rows B( n x 3). Both are sets of indices in another large 4D matrix; their data type dtype('int64'). I would like to create a boolean vector xwhere x[i] = True, if it Bdoes not contain a string A[i,:].

There are no two duplicate lines in Aor B.

I was wondering if there is an effective way to do this in Numpy? I found the answer, which is somewhat related: qaru.site/questions/532451 / ... ; however, it returns the actual strings (not a Boolean vector).

+4

python numpy matrix set-difference

John manak Jun 25 '15 at 11:22

source share

3 answers

3 , :

>>> a
array([[2, 2, 9],
       [6, 8, 5],
       [7, 8, 0],
       [6, 7, 8],
       [3, 8, 6],
       [9, 2, 3],
       [1, 2, 6],
       [2, 9, 8],
       [5, 8, 4],
       [8, 9, 1]])
>>> b
array([[2, 2, 9],
       [1, 2, 6],
       [2, 9, 8],
       [3, 8, 6],
       [9, 2, 3]])

>>> from functools import reduce
>>> pred = lambda i: a[:, i:i+1] == b[:,i]
>>> reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)
array([ True, False, False, False,  True,  True,  True,  True, False, False], dtype=bool)

m x n, .

, , , pandas.groupby.get_group_index, . , pandas groupby; , , :

>>> from pandas.core.groupby import get_group_index, _int64_overflow_possible
>>> from functools import partial

>>> shape = [1 + max(a[:, i].max(), b[:, i].max()) for i in range(a.shape[1])]
>>> assert not _int64_overflow_possible(shape)

>>> encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
>>> a1, b1 = map(encode, (a.T, b.T))
>>> np.in1d(a1, b1)
array([ True, False, False, False,  True,  True,  True,  True, False, False], dtype=bool)

+3

behzad.nouri 25 . '15 12:08

You can consider Aboth Bhow two sets of XYZ arrays and calculate euclidean distancesbetween them using scipy.spatial.distance.cdist, We would be interested in zero distances. This calculation of distances should be quite effective, therefore, I hope we will have an effective solution to solve our case. Thus, the implementation for finding such a logical conclusion will look like this:

from scipy.spatial import distance

out = ~np.any(distance.cdist(A,B)==0,1)
# OR np.all(distance.cdist(A,B)!=0,1)

Run Example -

In [582]: A
Out[582]: 
array([[0, 2, 2],
       [1, 0, 3],
       [3, 3, 3],
       [2, 0, 3],
       [2, 0, 1],
       [1, 1, 1]])

In [583]: B
Out[583]: 
array([[2, 0, 3],
       [2, 3, 3],
       [1, 1, 3],
       [2, 0, 1],
       [0, 2, 2],
       [2, 2, 2],
       [1, 2, 3]])

In [584]: out
Out[584]: array([False,  True,  True, False, False,  True], dtype=bool)

0

Divakar Jun 25 '15 at 12:01

source share

unutbu · Accepted Answer · 2015-06-25T11:35:52+0000

, jterrace, np.in1d np.setdiff1d:

import numpy as np
np.random.seed(2015)

m, n = 10, 5
A = np.random.randint(10, size=(m,3))
B = A[np.random.choice(m, n, replace=False)]
print(A)
# [[2 2 9]
#  [6 8 5]
#  [7 8 0]
#  [6 7 8]
#  [3 8 6]
#  [9 2 3]
#  [1 2 6]
#  [2 9 8]
#  [5 8 4]
#  [8 9 1]]

print(B)
# [[2 2 9]
#  [1 2 6]
#  [2 9 8]
#  [3 8 6]
#  [9 2 3]]

def using_view(A, B, assume_unique=False):
    Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
    Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
    return ~np.in1d(Ad, Bd, assume_unique=assume_unique)

print(using_view(A, B, assume_unique=True))

[False  True  True  True False False False False  True  True]

assume_unique=True ( ), A B .

, A.view(...)

ValueError: new type not compatible with array.

A.flags['C_CONTIGUOUS'] - False (.. A C- ). np.ascontiguous(A) view.

B.M. , "void" DTYPE:

def using_void(A, B):
    dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
    Ad = np.ascontiguousarray(A).view(dtype)
    Bd = np.ascontiguousarray(B).view(dtype)
    return ~np.in1d(Ad, Bd, assume_unique=True)

dtypes. ,

In [342]: np.array([-0.], dtype='float64').view('V8') == np.array([0.], dtype='float64').view('V8')
Out[342]: array([False], dtype=bool)

np.in1d , void .

:

import numpy as np
np.random.seed(2015)

m, n = 10000, 5000
# Note A may contain duplicate rows, 
# so don't use assume_unique=True for these benchmarks. 
# In this case, using assume_unique=False does not improve the speed much anyway.
A = np.random.randint(10, size=(2*m,3))
# make A not C_CONTIGUOUS; the view methods fail for non-contiguous arrays
A = A[::2]  
B = A[np.random.choice(m, n, replace=False)]

def using_view(A, B, assume_unique=False):
    Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
    Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
    return ~np.in1d(Ad, Bd, assume_unique=assume_unique)

from scipy.spatial import distance
def using_distance(A, B):
    return ~np.any(distance.cdist(A,B)==0,1)

from functools import reduce 
def using_loop(A, B):
    pred = lambda i: A[:, i:i+1] == B[:, i]
    return ~reduce(np.logical_and, map(pred, range(A.shape[1]))).any(axis=1)

from pandas.core.groupby import get_group_index, _int64_overflow_possible
from functools import partial
def using_pandas(A, B):
    shape = [1 + max(A[:, i].max(), B[:, i].max()) for i in range(A.shape[1])]
    assert not _int64_overflow_possible(shape)

    encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
    a1, b1 = map(encode, (A.T, B.T))
    return ~np.in1d(a1, b1)

def using_void(A, B):
    dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
    Ad = np.ascontiguousarray(A).view(dtype)
    Bd = np.ascontiguousarray(B).view(dtype)
    return ~np.in1d(Ad, Bd)

# Sanity check: make sure all the functions return the same result
for func in (using_distance, using_loop, using_pandas, using_void):
    assert (func(A, B) == using_view(A, B)).all()

In [384]: %timeit using_pandas(A, B)
100 loops, best of 3: 1.99 ms per loop

In [381]: %timeit using_void(A, B)
100 loops, best of 3: 6.72 ms per loop

In [378]: %timeit using_view(A, B)
10 loops, best of 3: 35.6 ms per loop

In [383]: %timeit using_loop(A, B)
1 loops, best of 3: 342 ms per loop

In [379]: %timeit using_distance(A, B)
1 loops, best of 3: 502 ms per loop

The difference in the matrix row, displays a logical vector

More articles: