Python PCA on matrix is ​​too large to fit in memory

I have a csv, which is 100,000 rows x 27,000 columns, which I am trying to do using PCA to create a matrix matrix of 30,000 rows x 300 columns. The csv is 9 GB in size. Here is what I am doing now:

from sklearn.decomposition import PCA as RandomizedPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] X = pd.DataFrame.from_csv(dataset) Y = X.pop("Y_Level") X = (X - X.mean()) / (X.max() - X.min()) Y = list(Y) dimensions = 300 sklearn_pca = RandomizedPCA(n_components=dimensions) X_final = sklearn_pca.fit_transform(X) 

When I run the above code, my program gets killed by taking the .from_csv step. I was able to get around this by dividing csv by 10000; reading them in 1 on 1, and then calling pd.concat. This allows me to go to the normalization phase (X - X.mean ()) .... before killing. Is my data too big for my macbook? Or is there a better way to do this. I would really like to use all the data that I have for my machine learning application.


If I wanted to use an incremental PCA, as suggested below, here's how I do it ?:

 from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 10000 #total_size is 100000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) Y = [] for chunk in reader: y = chunk.pop("virginica") Y = Y + list(y) sklearn_pca.partial_fit(chunk) X = ??? #This is were i'm stuck, how do i take my final pca and output it to X, #the normal transform method takes in an X, which I don't have because I #couldn't fit it into memory. 

I can not find good examples on the Internet.

+5
source share
2 answers

Try to split your data or load it in batches in a script, and adapt your PCA using Incremetal PCA with it partial_fit method on each batch.

 from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 5 * 25000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) for chunk in reader: y = chunk.pop("Y") sklearn_pca.partial_fit(chunk) # Computed mean per feature mean = sklearn_pca.mean_ # and stddev stddev = np.sqrt(sklearn_pca.var_) Xtransformed = None for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_): y = chunk.pop("Y") Xchunk = sklearn_pca.transform(chunk) if Xtransformed == None: Xtransformed = Xchunk else: Xtransformed = np.vstack((Xtransformed, Xchunk)) 

Useful link

+9
source

The PCA needs to calculate the correlation matrix, which will be 100,000 x 100,000. If the data is stored in double-local numbers, then it is 80 GB. I would argue that your Macbook does not have 80 GB of RAM.

The PCA conversion matrix is ​​likely to be almost the same for a random subset of a reasonable size.

0
source

All Articles