You are probably trying to load the entire data set into RAM. 32 bits per float32 Γ 1,000,000 Γ 1000 is 3.7 GiB. This can be a problem on machines with 4 RAM. To make sure this is really a problem, try creating an array of this size:
>>> import numpy as np >>> np.zeros((1000000, 1000), dtype=np.float32)
If you see a MemoryError , you either need more RAM, or you need to process your data set one piece at a time.
With h5py datasets, we just need to avoid passing the entire dataset to our methods and pass fragments of the dataset instead. One at a time.
Since I don't have data, let me start by creating a random data set of the same size:
import h5py import numpy as np h5 = h5py.File('rand-1Mx1K.h5', 'w') h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32) for i in range(1000): h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000) h5.close()
It creates a good 3.8GB file.
Now, if we are on Linux, we can limit how much memory is available to our program:
$ bash $ ulimit -m $((1024*1024*2)) $ ulimit -m 2097152
Now, if we try to run your code, we get a MemoryError. (press Ctrl-D to exit the new bash session and reset the limit later)
Try to solve the problem. We will create an IncrementalPCA object and call it the .partial_fit() method many times, providing each fragment of the data set each time.
import h5py import numpy as np from sklearn.decomposition import IncrementalPCA h5 = h5py.File('rand-1Mx1K.h5') data = h5['data']
It seems to work for me, and if I look at what top reports, the memory allocation remains below 200M.