Extremely slow row row operation in sparse LIL matrix in Python

Question

Extremely slow row row operation in sparse LIL matrix in Python

I wrote this code in Python that gives the expected results, but is extremely slow. The bottleneck is the summation of several scipy.sparse.lil_matrix rows. How can I do it fast?

# D1 is a 1.5M x 1.3M sparse matrix, read as scipy.sparse.lil_matrix. # D2 is a 1.5M x 111 matrix, read as numpy.array # F1 is a csv file, read using csv.reader for row in F1: user_id = row[0] clust = D2[user_id, 110] neighbors = D2[ D2[:, 110] == clust][:,1] score = np.zeros(1300000) for neigh in neighbors: score = score + D1 [neigh, :] # the most expensive operation toBeWritten = np.argsort(score)[:,::-1].A[0,:]

Please let me know if there is anything else that is not very optimal.

+1

python scipy sparse-matrix

Sonu mishra May 14, '16 at 20:21

source share

1 answer

hpaulj · Accepted Answer · 2016-05-14T21:06:54+0000

First demo with a very small matrix

 In [523]: idx=np.arange(0,8,2) In [526]: D=np.arange(24).reshape(8,3) In [527]: Dll=sparse.lil_matrix(D) In [528]: D[idx,:].sum(axis=0) Out[528]: array([36, 40, 44]) In [529]: Dll[idx,:].sum(axis=0) Out[529]: matrix([[36, 40, 44]], dtype=int32) In [530]: timeit D[idx,:].sum(axis=0) 100000 loops, best of 3: 17.3 µs per loop In [531]: timeit Dll[idx,:].sum(axis=0) 1000 loops, best of 3: 1.16 ms per loop In [532]: score=np.zeros(3) # your looping version In [533]: for i in idx: .....: score = score + Dll[i,:] In [534]: score Out[534]: matrix([[ 36., 40., 44.]]) In [535]: %%timeit .....: score=np.zeros(3) .....: for i in idx: score = score + Dll[i,:] .....: 100 loops, best of 3: 2.76 ms per loop

For some operations, the csr format is faster:

 In [537]: timeit Dll.tocsr()[idx,:].sum(axis=0) 1000 loops, best of 3: 955 µs per loop

or if I pre-convert to csr:

 In [538]: Dcsr=Dll.tocsr() In [539]: timeit Dcsr[idx,:].sum(axis=0) 1000 loops, best of 3: 724 µs per loop

Still slow relatively dense.

I was going to talk about working with sparse matrix data attributes as a way to quickly select rows. But if the only purpose for selecting these lines is to summarize their values, we do not need to do this.

The sum of sparse matrices in rows or columns, performing a matrix product with a column or row matrix of them. And I answered another question with the same answer.

fooobar.com/questions/1006020 / ... Efficiently compute columnwise sum of sparse array where every non-zero element is 1

For instance:

 In [588]: I=np.asmatrix(np.zeros((1,Dll.shape[0]))) In [589]: I[:,idx]=1 In [590]: I Out[590]: matrix([[ 1., 0., 1., 0., 1., 0., 1., 0.]]) In [591]: I*Dll Out[591]: matrix([[ 36., 40., 44.]]) In [592]: %%timeit I=np.asmatrix(np.zeros((1,Dll.shape[0]))) I[:,idx]=1 I*Dll .....: 1000 loops, best of 3: 919 µs per loop

For this small matrix, this did not help the speed, but with Dcsr time drops to 215 µs (this is much better for mathematics). With large matrices, this version of the product will improve.

==================

I just found out in another question that the row selection A_csr[[1,1,0,3],:] actually done using the matrix product. It creates a csr extractor matrix that looks like

 matrix([[0, 1, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1]])

fooobar.com/questions/1006022 / ...

Extremely slow row row operation in sparse LIL matrix in Python

More articles: