Calculation of the Numpy / Scipy scalar product for specific elements

I have a sparse matrix like A

and dataframe (df) with the lines that should be taken to calculate the scalar product.

Row1 Row2 Value 2 147 scalar product of vectors at Row1 and Raw2 in matrix A 

Can I do this in broadcast mode without loops, etc.?

In my case, A, as the size is 1m * 100k and the frame size is 10M

0
source share
3 answers

Start with a small “sparse” matrix (csr is best for math):

 In [167]: A=sparse.csr_matrix([[1, 2, 3], # Vadim example [2, 1, 4], [0, 2, 2], [3, 0, 3]]) In [168]: AA=AA # dense equivalent In [169]: idx=np.array([[1,1,0,3],[3,0,0,2]]).T # indexes 

I will stick with the numpy version (Pandas is built on top of numpy)

We could take all of the row products and select the subset defined by idx :

 In [170]: (AA.dot(AA.T))[idx[:,0], idx[:,1]] Out[170]: array([18, 16, 14, 6], dtype=int32) 

Sparse matrix product ( A.dot(AT) also works:

 In [171]: (A*AT)[idx[:,0], idx[:,1]] Out[171]: matrix([[18, 16, 14, 6]], dtype=int32) 

Or we can first select the rows and then take the sum of the products. We do not want to use dot here, since we do not accept all combinations.

 In [172]: (AA[idx[:,0]]*AA[idx[:,1]]).sum(axis=1) Out[172]: array([18, 16, 14, 6], dtype=int32) 

Version of this expression: 27>:

 In [180]: np.einsum('ij,ij->i',AA[idx[:,0]],AA[idx[:,1]]) Out[180]: array([18, 16, 14, 6], dtype=int32) 

sparse can do the same ( * is a matrix product, .multiply is an element by element).

 In [173]: (A[idx[:,0]].multiply(A[idx[:,1]])).sum(axis=1) Out[173]: matrix([[18], [16], [14], [ 6]], dtype=int32) 

In this small case, the dense versions are faster. Indexing sparse rows is slow.

 In [181]: timeit np.einsum('ij,ij->i', AA[idx[:,0]], AA[idx[:,1]]) 100000 loops, best of 3: 18.1 µs per loop In [182]: timeit (A[idx[:,0]].multiply(A[idx[:,1]])).sum(axis=1) 1000 loops, best of 3: 1.32 ms per loop In [184]: timeit (AA.dot(AA.T))[idx[:,0], idx[:,1]] 100000 loops, best of 3: 9.62 µs per loop In [185]: timeit (A*AT)[idx[:,0], idx[:,1]] 1000 loops, best of 3: 689 µs per loop 

I almost forgot - iterative versions:

 In [191]: timeit [AA[i].dot(AA[j]) for i,j in idx] 10000 loops, best of 3: 38.4 µs per loop In [192]: timeit [A[i].multiply(A[j]).sum() for i,j in idx] 100 loops, best of 3: 2.58 ms per loop 

Lil matrix matrix rows are faster to lil

 In [207]: Al=A.tolil() In [208]: timeit A[idx[:,0]] 1000 loops, best of 3: 476 µs per loop In [209]: timeit Al[idx[:,0]] 1000 loops, best of 3: 234 µs per loop 

But by the time it is converted back to csr for multiplication, this may not save time.

================

In other recent SO questions, I discussed faster ways to index sparse rows or columns. But in them, the ultimate goal was to sum over the selected set of rows or columns. To do this, it would be the fastest to use a matrix product - with a matrix of 1s and 0s. Applying this idea here is a bit more complicated.

Looking at the csr.__getitem__ indexing function, I found that it actually indexes A[idx,:] with the matrix product. It creates an extractor matrix with a function such as:

 def extractor(indices,N): """Return a sparse matrix P so that P*self implements slicing of the form self[[1,2,3],:] """ indptr = np.arange(len(indices)+1, dtype=int) data = np.ones(len(indices), dtype=int) shape = (len(indices),N) return sparse.csr_matrix((data,indices,indptr), shape=shape) In [328]: %%timeit .....: A1=extractor(idx[:,0],4)*A .....: A2=extractor(idx[:,1],4)*A .....: (A1.multiply(A2)).sum(axis=1) .....: 1000 loops, best of 3: 1.14 ms per loop 

This time is slightly better than A[idx[:,0],:] ( In[182] above) - presumably because it simplifies things a bit. It should scale in the same way.

This works because idx0 is a Boolean matrix derived from [1,1,0,3]

 In [330]: extractor(idx[:,0],4).A Out[330]: array([[0, 1, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1]]) In [296]: A[idx[:,0],:].A Out[296]: array([[2, 1, 4], [2, 1, 4], [1, 2, 3], [3, 0, 3]], dtype=int32) In [331]: (extractor(idx[:,0],4)*A).A Out[331]: array([[2, 1, 4], [2, 1, 4], [1, 2, 3], [3, 0, 3]], dtype=int32) 

=================

In general, if the problem is too great to directly use a dense array, then it is best to use scaling for a large sparse case.

 (A[idx[:,0]].multiply(A[idx[:,1]])).sum(axis=1) 

If this still causes memory errors, then an iteration, possibly over the groups of the idx (or dataframe) array.

0
source

If I understand your question correctly, you can use the dot function in Pandas to calculate the point product between two rows:

 A['Row1'].dot(A['Row2']) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dot.html

0
source

I think .assign() and .apply() (for pandas> 0.16.0) is suitable:

 import numpy as np from pandas import DataFrame from scipy.sparse import bsr_matrix df = DataFrame(np.random.randint(4, size=(4, 2)), columns=['Row1', 'Row2']) A = bsr_matrix([[1, 2, 3], [2, 1, 4], [0, 2, 2], [3, 0, 3]]) A = A.tocsr() # Skip this if your matrix is csc_, csr_, dok_ or lil_matrix df.assign(Value=df.apply(lambda row: A[row[0]].dot(A[row[1]].transpose())[0, 0], axis=1)) Out[15]: Row1 Row2 Value 0 1 3 18 1 1 0 16 2 0 0 14 3 3 2 6 
0
source

All Articles