Save / load scipy sparse csr_matrix in portable data format

How do you save / load scipy sparse csr_matrix in portable format? Scipy sparse matrix is ​​created in Python 3 (Windows 64-bit) to run in Python 2 (Linux 64-bit). I originally used pickle (with protocol = 2 and fix_imports = True), but this did not work with Python 3.2.2 (Windows 64-bit) on Python 2.7.2 (32-bit version of Windows) and received an error:

 TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')). 

Next, we tried numpy.save and numpy.load , as well as scipy.io.mmwrite() and scipy.io.mmread() , and none of these methods worked.

+76
python numpy scipy
Jan 21 '12 at 18:20
source share
10 answers

edit: SciPy 1.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz .

 from scipy import sparse sparse.save_npz("yourmatrix.npz", your_matrix) your_matrix_back = sparse.load_npz("yourmatrix.npz") 

For both functions, the file argument can also be a file object (i.e., the result of open ) instead of the file name.




Got a response from the Scipy user group:

Csr_matrix has 3 data attributes that have a value: .data , .indices and .indptr . These are all simple ndarrays, so numpy.save will work on them. Save the three arrays using numpy.save or numpy.savez , load them back using numpy.load , and then re- numpy.load the sparse matrix object with:

 new_csr = csr_matrix((data, indices, indptr), shape=(M, N)) 

For example:

 def save_sparse_csr(filename, array): np.savez(filename, data=array.data, indices=array.indices, indptr=array.indptr, shape=array.shape) def load_sparse_csr(filename): loader = np.load(filename) return csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape']) 
+105
Jan 23 2018-12-23T00:
source share

Although you write scipy.io.mmwrite and scipy.io.mmread do not work for you, I just want to add how they work. This question is no. 1 Google hit, so I started with np.savez and pickle.dump before moving on to the simple and obvious scipy functions. They work for me and should not be controlled by those who have not tried them yet.

 from scipy import sparse, io m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]]) m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format> io.mmwrite("test.mtx", m) del m newm = io.mmread("test.mtx") newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format> newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format> newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32) 
+36
Mar 11 '15 at 21:55
source share

Below is a comparison of the performance of the three most common answers using a Jupyter laptop. The input is a random allowed matrix of size 1 M x 100 KB with a density of 0.001, containing 100 M non-zero values:

 from scipy.sparse import random matrix = random(1000000, 100000, density=0.001, format='csr') matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> 

io.mmwrite / io.mmread

 from scipy.sparse import io %time io.mmwrite('test_io.mtx', matrix) CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s Wall time: 4min 39s %time matrix = io.mmread('test_io.mtx') CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s Wall time: 2min 43s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in COOrdinate format> Filesize: 3.0G. 

(note that the format has been changed from csr to coo).

np.savez / np.load

 import numpy as np from scipy.sparse import csr_matrix def save_sparse_csr(filename, array): # note that .npz extension is added automatically np.savez(filename, data=array.data, indices=array.indices, indptr=array.indptr, shape=array.shape) def load_sparse_csr(filename): # here we need to add .npz extension manually loader = np.load(filename + '.npz') return csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape']) %time save_sparse_csr('test_savez', matrix) CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s Wall time: 2.74 s %time matrix = load_sparse_csr('test_savez') CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s Wall time: 1.73 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G. 

cPickle

 import cPickle as pickle def save_pickle(matrix, filename): with open(filename, 'wb') as outfile: pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL) def load_pickle(filename): with open(filename, 'rb') as infile: matrix = pickle.load(infile) return matrix %time save_pickle(matrix, 'test_pickle.mtx') CPU times: user 260 ms, sys: 888 ms, total: 1.15 s Wall time: 1.15 s %time matrix = load_pickle('test_pickle.mtx') CPU times: user 376 ms, sys: 988 ms, total: 1.36 s Wall time: 1.37 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G. 

Note : cPickle does not work with very large objects (see this answer ). In my experience, it did not work for a 2.7M x 50k matrix with non-zero 270M values. np.savez worked well.

Conclusion

(based on this simple test for CSR matrices) cPickle is the fastest method, but it doesn’t work with very large matrices, np.savez only a little slower, and io.mmwrite is much slower, creates a larger file and restores the wrong format. So np.savez is the winner here.

+25
Feb 07 '17 at 23:06
source share
+16
Apr 03 '17 at 10:36 on
source share

Assuming you have scipy on both machines, you can just use pickle .

However, be sure to specify the binary protocol when etching numpy arrays. Otherwise, you will get a huge file.

Anyway, you should do this:

 import cPickle as pickle import numpy as np import scipy.sparse # Just for testing, let make a dense array and convert it to a csr_matrix x = np.random.random((10,10)) x = scipy.sparse.csr_matrix(x) with open('test_sparse_array.dat', 'wb') as outfile: pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL) 

Then you can download it with:

 import cPickle as pickle with open('test_sparse_array.dat', 'rb') as infile: x = pickle.load(infile) 
+11
Jan 21 2018-12-21T00:
source share

With scipy 0.19.0, you can save and load sparse matrices as follows:

 from scipy import sparse data = sparse.csr_matrix((3, 4)) #Save sparse.save_npz('data_sparse.npz', data) #Load data = sparse.load_npz("data_sparse.npz") 
+9
Apr 28 '17 at 10:22
source share

Adding two of my cents: for me, npz not portable, because I cannot use it to simply export my matrix to non-Python clients (for example, PostgreSQL - I'm glad I got better). So I would like to get the CSV output for a sparse matrix (just as you would get, you print() sparse matrix). How to achieve this depends on the representation of the sparse matrix. For the CSR matrix, the following code extracts the output of the CSV. You can adapt to other views.

 import numpy as np def csr_matrix_tuples(m): # not using unique will lag on empty elements uindptr, uindptr_i = np.unique(m.indptr, return_index=True) for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])): for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]): yield (i, j, data) for i, j, data in csr_matrix_tuples(my_csr_matrix): print(i, j, data, sep=',') 

This is about 2 times slower than save_npz in the current implementation, from what I tested.

+2
Apr 09 '19 at 0:09
source share

This is what I used to save lil_matrix .

 import numpy as np from scipy.sparse import lil_matrix def save_sparse_lil(filename, array): # use np.savez_compressed(..) for compression np.savez(filename, dtype=array.dtype.str, data=array.data, rows=array.rows, shape=array.shape) def load_sparse_lil(filename): loader = np.load(filename) result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"])) result.data = loader["data"] result.rows = loader["rows"] return result 

I must say that I found NumPy np.load (..) very slow. This is my current solution, I feel much faster:

 from scipy.sparse import lil_matrix import numpy as np import json def lil_matrix_to_dict(myarray): result = { "dtype": myarray.dtype.str, "shape": myarray.shape, "data": myarray.data, "rows": myarray.rows } return result def lil_matrix_from_dict(mydict): result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"]) result.data = np.array(mydict["data"]) result.rows = np.array(mydict["rows"]) return result def load_lil_matrix(filename): result = None with open(filename, "r", encoding="utf-8") as infile: mydict = json.load(infile) result = lil_matrix_from_dict(mydict) return result def save_lil_matrix(filename, myarray): with open(filename, "w", encoding="utf-8") as outfile: mydict = lil_matrix_to_dict(myarray) json.dump(mydict, outfile) 
+1
Dec 27 '16 at 18:31
source share

I was asked to send the matrix in a simple and general format:

 <x,y,value> 

I ended up with this:

 def save_sparse_matrix(m,filename): thefile = open(filename, 'w') nonZeros = np.array(m.nonzero()) for entry in range(nonZeros.shape[1]): thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]])) 
0
Jan 15 '17 at 12:45
source share

This works for me:

 import numpy as np import scipy.sparse as sp x = sp.csr_matrix([1,2,3]) y = sp.csr_matrix([2,3,4]) np.savez(file, x=x, y=y) npz = np.load(file) >>> npz['x'].tolist() <1x3 sparse matrix of type '<class 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> >>> npz['x'].tolist().toarray() array([[1, 2, 3]], dtype=int64) 

The trick was to call .tolist() to convert the array of an object of form 0 to the original object.

0
Aug 26 '19 at 14:34
source share



All Articles