Due to the nature of the PCA, even if the input is a sparse matrix, the output is not. You can check this with a quick example:
>>> from sklearn.decomposition import TruncatedSVD >>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as nonzero.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100) >>> Xpca = clf.fit_transform(X)
Now check the results:
>>> type(X) scipy.sparse.coo.coo_matrix >>> type(Xpca) numpy.ndarray >>> print np.count_nonzero(Xpca), Xpca.size 95000, 100000
which suggests that 95,000 entries are not null, however
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size 99481, 100000
99,481 items closer to 0 ( <1e-15 ), but not 0 .
This means that it means that for the PCA, even if the input is a sparse matrix, the output is not. Thus, if you try to extract components from 100,000,000 ( 1e8 ) from your matrix, you get a dense matrix 1e8 x n_features (in your example 1e8 x 1617899 ), which, of course, cannot be stored in memory.
I am not an expert statistician, but I believe that currently this does not require employment using scikit-learn, since it is not a problem to implement scikit-learn, it is just a mathematical definition of their Sparse PCA (through a rare SVD), which makes the result tight.
The only workaround that may work for you is that you start with a small number of components and increase it until you get a balance between the data you can store in memory and the percentage of data explained (which you can calculate in the following way:
>>> clf.explained_variance_ratio_.sum()