How to calculate the variance of a sparse matrix column in Scipy?

I have a big scipy.sparse.csc_matrix and I want to normalize it. This subtracts the average column value from each element and divides by the standard deviation of the column (std) i.

scipy.sparse.csc_matrix has .mean() , but is there an efficient way to calculate variance or std?

+8
python numpy scipy
source share
2 answers

You can calculate the variance yourself, using the average value, with the following formula :

 E[X^2] - (E[X])^2 

E[X] means the average value. Therefore, to compute E[X^2] , you will need the csc_matrix square, and then use the mean function. To get (E[X])^2 , you just need to square the result of the mean function, obtained using regular input.

+5
source share

An effective way is to actually compact the entire matrix, and then standardize it in the usual way with

 X = X.toarray() X -= X.mean() X /= X.std() 

As @Sebastian noted in his comments, standardization destroys the sparseness structure (introduces many nonzero elements) at the subtraction stage, so there is no need to keep the matrix in a sparse format.

+3
source share

All Articles