Incremental PCA Big Data

Question

Incremental PCA Big Data

I just tried using IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like PCA and RandomizedPCA before. My problem is that the matrix I'm trying to load is too large to fit into RAM. Now it is stored in the hdf5 database as a data set of the form ~ (1,000,000, 1,000), so I have 1,000,000,000 float32 values. I thought that IncrementalPCA loads data in batches, but it seems that it is trying to load the entire data set, which doesn't help. How is this library intended for use? Problem in hdf5 format?

from sklearn.decomposition import IncrementalPCA import h5py db = h5py.File("db.h5","r") data = db["data"] IncrementalPCA(n_components=10, batch_size=1).fit(data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit X = check_array(X, dtype=np.float) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array array = np.atleast_2d(array) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d ary = asanyarray(ary) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray return array(a, dtype, copy=False, order=order, subok=True) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__ arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype) MemoryError

thanks for the help

+9

python scikit-learn pca hdf5 bigdata

Krawallkurt Jul 15 '15 at 11:00

source share

2 answers

I'm sitting with a similar problem. A large data set, a small laptop and things to cluster.

However, in this solution, I see a potential problem for scaling data, since this cannot happen only in each cluster, but must be done for the entire data set. Any ideas?

Lekker!

0

Coenraad middel Jun 19 '19 at 11:32

source share

sastanin · Accepted Answer · 2015-07-15T13:41:25+0000

You are probably trying to load the entire data set into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. This can be a problem on machines with 4 RAM. To make sure this is really a problem, try creating an array of this size:

 >>> import numpy as np >>> np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError , you either need more RAM, or you need to process your data set one piece at a time.

With h5py datasets, we just need to avoid passing the entire dataset to our methods and pass fragments of the dataset instead. One at a time.

Since I don't have data, let me start by creating a random data set of the same size:

 import h5py import numpy as np h5 = h5py.File('rand-1Mx1K.h5', 'w') h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32) for i in range(1000): h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000) h5.close()

It creates a good 3.8GB file.

Now, if we are on Linux, we can limit how much memory is available to our program:

 $ bash $ ulimit -m $((1024*1024*2)) $ ulimit -m 2097152

Now, if we try to run your code, we get a MemoryError. (press Ctrl-D to exit the new bash session and reset the limit later)

Try to solve the problem. We will create an IncrementalPCA object and call it the .partial_fit() method many times, providing each fragment of the data set each time.

 import h5py import numpy as np from sklearn.decomposition import IncrementalPCA h5 = h5py.File('rand-1Mx1K.h5') data = h5['data'] # it ok, the dataset is not fetched to memory yet n = data.shape[0] # how many rows we have in the dataset chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n icpa = IncrementalPCA(n_components=10, batch_size=16) for i in range(0, n//chunk_size): ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to work for me, and if I look at what top reports, the memory allocation remains below 200M.

Incremental PCA Big Data

More articles: