I am trying to calculate a simple point product, but leave the nonzero values from the original matrix unchanged. Toy example:
import numpy as np A = np.array([[2, 1, 1, 2], [0, 2, 1, 0], [1, 0, 1, 1], [2, 2, 1, 0]]) B = np.array([[ 0.54331039, 0.41018682, 0.1582158 , 0.3486124 ], [ 0.68804647, 0.29520239, 0.40654206, 0.20473451], [ 0.69857579, 0.38958572, 0.30361365, 0.32256483], [ 0.46195299, 0.79863505, 0.22431876, 0.59054473]])
Desired Result:
C = np.array([[ 2. , 1. , 1. , 2. ], [ 2.07466874, 2. , 1. , 0.73203386], [ 1. , 1.5984076 , 1. , 1. ], [ 2. , 2. , 1. , 1.42925865]])
The actual matrices in question, however, are sparse and more like this:
A = sparse.rand(250000, 1700, density=0.001, format='csr') B = sparse.rand(1700, 1700, density=0.02, format='csr')
One easy way is to simply set the values using the mask index, for example:
mask = A != 0 C = A.dot(B) C[mask] = A[mask]
However, my original arrays are sparse and quite large, so changing them using index assignment is very slow. Converting to lil matrix helps a bit, but then again, converting takes a lot of time.
Another obvious approach, I think, would be to just resort to iteration and skip masked values, but I would not want to throw away the advantages of multiplying the numpy / scipy-optimized array.
Some explanations: I am really interested in some special case where B always a square, and therefore A and C have the same shape. Therefore, if there is a solution that does not work on arbitrary arrays, but is suitable for my case, this is fine.
UPDATE: Some attempts:
from scipy import sparse import numpy as np def naive(A, B): mask = A != 0 out = A.dot(B).tolil() out[mask] = A[mask] return out.tocsr() def proposed(A, B): Az = A == 0 R, C = np.where(Az) out = A.copy() out[Az] = np.einsum('ij,ji->i', A[R], B[:, C]) return out %timeit naive(A, B) 1 loops, best of 3: 4.04 s per loop %timeit proposed(A, B) /usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py:215: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead. /usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.pyc in __init__(self, arg1, shape, dtype, copy) 173 self.shape = M.shape 174 --> 175 self.row, self.col = M.nonzero() 176 self.data = M[self.row, self.col] 177 self.has_canonical_format = True MemoryError:
OTHER UPDATE:
Can't do anything more or less useful from Cython, at least not going too far from Python. The idea was to leave the point product in scipy and just try to set these initial values as quickly as possible, something like this:
cimport cython @cython.cdivision(True) @cython.boundscheck(False) @cython.wraparound(False) cpdef coo_replace(int [:] row1, int [:] col1, float [:] data1, int[:] row2, int[:] col2, float[:] data2): cdef int N = row1.shape[0] cdef int M = row2.shape[0] cdef int i, j cdef dict d = {} for i in range(M): d[(row2[i], col2[i])] = data2[i] for j in range(N): if (row1[j], col1[j]) in d: data1[j] = d[(row1[j], col1[j])]
It was a little better than my preliminary “naive” implementation (using .tolil() ), but after the approach hpaulj lil can be thrown away. Perhaps replacing the python dict with something like std :: map will help.