Ignoring duplicate entries in a sparse matrix

I tried to initialize csc_matrix and csr_matrix from a list of values (data, (rows, cols)) , as the documentation suggests.

 sparse = csc_matrix((data, (rows, cols)), shape=(n, n)) 

The problem is that the method that I actually have for generating data , rows and cols introduces duplicates for some points. By default, scipy adds duplicate entry values. However, in my case, these duplicates have exactly the same value in data for the given (row, col) .

What I'm trying to achieve is to make scipy ignore the second entry if it already exists, and not add them.

Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or other way to create a sparse matrix that ignores duplicates?

There are currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix, the value of which at (1,1) is 8 , while the desired value is 4 .

 >>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3)) >>> c.todense() matrix([[0, 0, 0], [0, 8, 0], [0, 0, 0]]) 

I also know that I could filter them using the 2-dimensional numpy unique function, but the lists are quite large, so this is not really a valid option.

Another possible answer to the question is: is there a way to indicate what to do with duplicates? those. saving min or max instead of the standard sum ?

+5
source share
1 answer

Creating an intermediate dok matrix works in your example:

 In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc() In [411]: cA Out[411]: array([[0, 0, 0], [0, 4, 0], [0, 0, 0]], dtype=int32) 
Matrix

A coo puts your input arrays in the data , col , row attributes unchanged. Summation does not occur until it is converted to csc .

todok loads the dictionary directly from coo attributes. It creates an empty dok matrix and fills it:

 dok.update(izip(izip(self.row,self.col),self.data)) 

So, if there are duplicate values (row,col) , this is the last one that remains. This uses standard Python dictionary hashing to find unique keys.


You can use np.unique . I had to build a special array of objects, because unique works on 1d, and we index 2d.

 In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]] In [480]: x=np.zeros(cols.shape,dtype=object) In [481]: x[:]=list(zip(rows,cols)) In [482]: x Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object) In [483]: i=np.unique(x,return_index=True)[1] In [484]: i Out[484]: array([0, 1, 4, 2], dtype=int32) In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3)) In [486]: c1.A Out[486]: array([[1, 0, 0], [0, 4, 2], [0, 1, 0]], dtype=int32) 

I have no idea which approach is faster.


An alternative way to get a unique index at liuengo's link:

 rc = np.vstack([rows,cols]).T.copy() dt = rc.dtype.descr * 2 i = np.unique(rc.view(dt), return_index=True)[1] 

rc must have its own data to change the dtype with the view, therefore .T.copy() .

 In [554]: rc.view(dt) Out[554]: array([[(0, 0)], [(1, 1)], [(2, 1)], [(1, 1)], [(1, 2)]], dtype=[('f0', '<i4'), ('f1', '<i4')]) 
+4
source

Source: https://habr.com/ru/post/1214001/


All Articles