Running a PCA on a large sparse matrix using sklearn

Question

Running a PCA on a large sparse matrix using sklearn

I am trying to apply PCA on a huge sparse matrix, the following link says randomizedPCA sklearn can handle a sparse matrix of a lean sparse format. Apply PCA on a very large sparse matrix

However, I always get an error. Can someone point out what I'm doing wrong.

The input matrix 'X_train' contains the numbers in float64:

>>>type(X_train) <class 'scipy.sparse.csr.csr_matrix'> >>>X_train.shape (2365436, 1617899) >>>X_train.ndim 2 >>>X_train[0] <1x1617899 sparse matrix of type '<type 'numpy.float64'>' with 81 stored elements in Compressed Sparse Row format>

I am trying to do:

 >>>from sklearn.decomposition import RandomizedPCA >>>pca = RandomizedPCA() >>>pca.fit(X_train) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit self._fit(check_array(X)) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array copy, force_all_finite) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I try to convert to a dense matrix, I think I lost my memory.

 >>> pca.fit(X_train.toarray()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray return self.tocoo(copy=False).toarray(order=order, out=out) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray B = self._process_toarray_args(order, out) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order) MemoryError

+7

python scikit-learn sparse-matrix pca svd

khassan Nov 09 '15 at 6:46

source share

1 answer

Imanol luengo · Answer 1 · 2015-11-09T11:30:06+0000

Due to the nature of the PCA, even if the input is a sparse matrix, the output is not. You can check this with a quick example:

 >>> from sklearn.decomposition import TruncatedSVD >>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as nonzero.

 >>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

 >>> clf = TruncatedSVD(100) >>> Xpca = clf.fit_transform(X)

Now check the results:

 >>> type(X) scipy.sparse.coo.coo_matrix >>> type(Xpca) numpy.ndarray >>> print np.count_nonzero(Xpca), Xpca.size 95000, 100000

which suggests that 95,000 entries are not null, however

 >>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size 99481, 100000

99,481 items closer to 0 ( <1e-15 ), but not 0 .

This means that it means that for the PCA, even if the input is a sparse matrix, the output is not. Thus, if you try to extract components from 100,000,000 ( 1e8 ) from your matrix, you get a dense matrix 1e8 x n_features (in your example 1e8 x 1617899 ), which, of course, cannot be stored in memory.

I am not an expert statistician, but I believe that currently this does not require employment using scikit-learn, since it is not a problem to implement scikit-learn, it is just a mathematical definition of their Sparse PCA (through a rare SVD), which makes the result tight.

The only workaround that may work for you is that you start with a small number of components and increase it until you get a balance between the data you can store in memory and the percentage of data explained (which you can calculate in the following way:

 >>> clf.explained_variance_ratio_.sum()

Running a PCA on a large sparse matrix using sklearn

More articles: