I have a csv, which is 100,000 rows x 27,000 columns, which I am trying to do using PCA to create a matrix matrix of 30,000 rows x 300 columns. The csv is 9 GB in size. Here is what I am doing now:
from sklearn.decomposition import PCA as RandomizedPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] X = pd.DataFrame.from_csv(dataset) Y = X.pop("Y_Level") X = (X - X.mean()) / (X.max() - X.min()) Y = list(Y) dimensions = 300 sklearn_pca = RandomizedPCA(n_components=dimensions) X_final = sklearn_pca.fit_transform(X)
When I run the above code, my program gets killed by taking the .from_csv step. I was able to get around this by dividing csv by 10000; reading them in 1 on 1, and then calling pd.concat. This allows me to go to the normalization phase (X - X.mean ()) .... before killing. Is my data too big for my macbook? Or is there a better way to do this. I would really like to use all the data that I have for my machine learning application.
If I wanted to use an incremental PCA, as suggested below, here's how I do it ?:
from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 10000
I can not find good examples on the Internet.
source share