Python PCA on matrix is too large to fit in memory

Question

Python PCA on matrix is too large to fit in memory

I have a csv, which is 100,000 rows x 27,000 columns, which I am trying to do using PCA to create a matrix matrix of 30,000 rows x 300 columns. The csv is 9 GB in size. Here is what I am doing now:

from sklearn.decomposition import PCA as RandomizedPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] X = pd.DataFrame.from_csv(dataset) Y = X.pop("Y_Level") X = (X - X.mean()) / (X.max() - X.min()) Y = list(Y) dimensions = 300 sklearn_pca = RandomizedPCA(n_components=dimensions) X_final = sklearn_pca.fit_transform(X)

When I run the above code, my program gets killed by taking the .from_csv step. I was able to get around this by dividing csv by 10000; reading them in 1 on 1, and then calling pd.concat. This allows me to go to the normalization phase (X - X.mean ()) .... before killing. Is my data too big for my macbook? Or is there a better way to do this. I would really like to use all the data that I have for my machine learning application.

If I wanted to use an incremental PCA, as suggested below, here's how I do it ?:

 from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 10000 #total_size is 100000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) Y = [] for chunk in reader: y = chunk.pop("virginica") Y = Y + list(y) sklearn_pca.partial_fit(chunk) X = ??? #This is were i'm stuck, how do i take my final pca and output it to X, #the normal transform method takes in an X, which I don't have because I #couldn't fit it into memory.

I can not find good examples on the Internet.

+5

python pandas scikit-learn machine-learning pca

mt88 Aug 24 '15 at 20:30

source share

2 answers

The PCA needs to calculate the correlation matrix, which will be 100,000 x 100,000. If the data is stored in double-local numbers, then it is 80 GB. I would argue that your Macbook does not have 80 GB of RAM.

The PCA conversion matrix is likely to be almost the same for a random subset of a reasonable size.

0

Don reba Aug 24 '15 at 20:41

source share

Ibraim ganiev · Accepted Answer · 2015-08-24T21:02:43+0000

Try to split your data or load it in batches in a script, and adapt your PCA using Incremetal PCA with it partial_fit method on each batch.

 from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 5 * 25000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) for chunk in reader: y = chunk.pop("Y") sklearn_pca.partial_fit(chunk) # Computed mean per feature mean = sklearn_pca.mean_ # and stddev stddev = np.sqrt(sklearn_pca.var_) Xtransformed = None for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_): y = chunk.pop("Y") Xchunk = sklearn_pca.transform(chunk) if Xtransformed == None: Xtransformed = Xchunk else: Xtransformed = np.vstack((Xtransformed, Xchunk))

Useful link

Python PCA on matrix is ​​too large to fit in memory

More articles:

Python PCA on matrix is too large to fit in memory