How to create a huge sparse matrix in scipy

I am trying to create a very huge sparse matrix that has a shape (447957347, 5027974) . And it contains 3289 288 566 elements.

But when I create csr_matrix using scipy.sparse , it returns something like this:

 <447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>' with -1005678730 stored elements in Compressed Sparse Row format> 

Source code for creating a matrix:

 indptr = np.array(a, dtype=np.uint32) # a is a python array('L') contain row index information indices = np.array(b, dtype=np.uint32) # b is a python array('L') contain column index information data = np.ones((len(indices),), dtype=np.uint32) test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32) 

And I also found that when I convert a 3 billion python array to a numpy array, this will throw an error:

 ValueError:setting an array element with a sequence 

But when I create three 1 billion python arrays and convert them to a numpy array, add them. It works great.

I am embarrassed.

+6
source share
1 answer

You are using an older version of SciPy. In the original implementation of sparse matrices, indexes are stored in the int32 variable, even on 64-bit systems. Even if you define them as uint32 , like you do, they get casting. Therefore, whenever your matrix has more than 2^31 - 1 non-zero entries, as in your case, there is an overflow of indexing and many bad things. Please note that in your case, a strange negative number of elements is explained:

 >>> np.int32(np.int64(3289288566)) -1005678730 

The good news is that it has already been clarified. I think this one is relevant PR, although a few more fixes appeared after it. In any case, if you use the latest candidate release for SciPy 0.14, your problem will not go away.

+8
source

All Articles