A collection of rare scipy matrices is effectively accumulating

Question

A collection of rare scipy matrices is effectively accumulating

I have a collection of O (N) NxN scipy.sparse.csr_matrix , and each sparse matrix has an order of N elements. I want to add all these matrices together to get a regular NxN numpy array. (N of the order of 1000). The arrangement of nonzero elements inside the matrices is such that the resulting sum is, of course, not sparse (in fact, there are practically no zero elements).

Right now I'm just doing

 reduce(lambda x,y: x+y,[m.toarray() for m in my_sparse_matrices])

which works, but a little slow: of course, the huge amount of pointless processing of zeros that happens there is absolutely terrifying.

Is there a better way? There is nothing obvious to me in docs .

Update: at the suggestion of user545424, I tried an alternative scheme for summing sparse matrices, and also summarized sparse matrices on a dense matrix. The code below shows all approaches to running in comparable time (Python 2.6.6 on amd64 Debian / Squeeze on a quad-core i7 processor)

 import numpy as np import numpy.random import scipy import scipy.sparse import time N=768 S=768 D=3 def mkrandomsparse(): m=np.zeros((S,S),dtype=np.float32) r=np.random.random_integers(0,S-1,D*S) c=np.random.random_integers(0,S-1,D*S) for e in zip(r,c): m[e[0],e[1]]=1.0 return scipy.sparse.csr_matrix(m) M=[mkrandomsparse() for i in xrange(N)] def plus_dense(): return reduce(lambda x,y: x+y,[m.toarray() for m in M]) def plus_sparse(): return reduce(lambda x,y: x+y,M).toarray() def sum_dense(): return sum([m.toarray() for m in M]) def sum_sparse(): return sum(M[1:],M[0]).toarray() def sum_combo(): # Sum the sparse matrices 'onto' a dense matrix? return sum(M,np.zeros((S,S),dtype=np.float32)) def benchmark(fn): t0=time.time() fn() t1=time.time() print "{0:16}: {1:.3f}s".format(fn.__name__,t1-t0) for i in xrange(4): benchmark(plus_dense) benchmark(plus_sparse) benchmark(sum_dense) benchmark(sum_sparse) benchmark(sum_combo) print

and logs out

 plus_dense : 1.368s plus_sparse : 1.405s sum_dense : 1.368s sum_sparse : 1.406s sum_combo : 1.039s

although you can get one or the other approach to go ahead 2 times or so, messing around with the parameters N, S, D ... but it doesn’t seem like an improvement in the order that you hope to see given that the number zero is added, you should skip .

+7

optimization python numpy scipy sparse-matrix

timday Jun 28 '12 at 23:22

source share

4 answers

@ user545424 already posted what is likely to be the fastest solution. Something in the same spirit that is more readable and ~ the same speed ... nonzero () has all kinds of useful applications.

 def sum_sparse(m): x = np.zeros(m[0].shape,m[0].dtype) for a in m: # old lines #ri = np.repeat(np.arange(a.shape[0]),np.diff(a.indptr)) #x[ri,a.indices] += a.data # new line x[a.nonzero()] += a.data return x

+2

Phil cooper Jul 2 '12 at 10:47

source share

Can't you just add them together before converting to a dense matrix?

 >>> sum(my_sparse_matrices[1:],my_sparse_matrices[0]).todense()

+1

user545424 Jun 29 '12 at 0:22

source share

This is not a complete answer (and I would also like to get a more detailed answer), but you can get an easy factor of two or more improvements without creating intermediate results:

 def sum_dense(): return sum([m.toarray() for m in M]) def sum_dense2(): return sum((m.toarray() for m in M))

On my machine (YMMV) this leads to a quick calculation. Putting the summation in () , and not a [] , we build instead of summing the generator, and not the entire list.

+1

Hooked Jun 29 '12 at 2:31

source share

user545424 · Accepted Answer · 2012-06-29T17:37:20+0000

I think I found a way to speed it up 10 times if your matrices are very sparse.

 In [1]: from scipy.sparse import csr_matrix In [2]: def sum_sparse(m): ...: x = np.zeros(m[0].shape) ...: for a in m: ...: ri = np.repeat(np.arange(a.shape[0]),np.diff(a.indptr)) ...: x[ri,a.indices] += a.data ...: return x ...: In [6]: m = [np.zeros((100,100)) for i in range(1000)] In [7]: for x in m: ...: x.ravel()[np.random.randint(0,x.size,10)] = 1.0 ...: m = [csr_matrix(x) for x in m] In [17]: (sum(m[1:],m[0]).todense() == sum_sparse(m)).all() Out[17]: True In [18]: %timeit sum(m[1:],m[0]).todense() 10 loops, best of 3: 145 ms per loop In [19]: %timeit sum_sparse(m) 100 loops, best of 3: 18.5 ms per loop

A collection of rare scipy matrices is effectively accumulating

More articles: