How to load large .mat files in python?

I have a very large .mat file (~ 1.3 GB) that I am trying to load in my Python code (IPython laptop). I tried:

import scipy.io as sio very_large = sio.loadmat('very_large.mat') 

And my laptop with 8 GB of RAM hangs. I left the system monitor open and saw that the memory consumption is constantly increasing to 7 GB, and then the system freezes.

What am I doing wrong? Any suggestion / work around?

EDIT:

Data Details: Here is the data link: http://ufldl.stanford.edu/housenumbers/

The specific file of my interest is extra_32x32.mat. From the description: Downloading .mat files creates two variables: X, which is a four-dimensional matrix containing images, and y, which is a vector of class labels. To access images, X (:,:,:,, i) gives the i-th 32-bit RGB image with a class label of y (i).

So, for example, a smaller .mat file from the same page (test_32x32.mat) when loaded as follows:

 SVHN_full_test_data = sio.loadmat('test_32x32.mat') print("\nData set = SVHN_full_test_data") for key, value in SVHN_full_test_data.iteritems(): print("Type of", key, ":", type(SVHN_full_test_data[key])) if str(type(SVHN_full_test_data[key])) == "<type 'numpy.ndarray'>": print("Shape of", key, ":", SVHN_full_test_data[key].shape) else: print("Content:", SVHN_full_test_data[key]) 

gives:

 Data set = SVHN_full_test_data Type of y : <type 'numpy.ndarray'> Shape of y : (26032, 1) Type of X : <type 'numpy.ndarray'> Shape of X : (32, 32, 3, 26032) Type of __version__ : <type 'str'> Content: 1.0 Type of __header__ : <type 'str'> Content: MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Mon Dec 5 21:18:15 2011 Type of __globals__ : <type 'list'> Content: [] 
+5
source share
1 answer

This answer depends on two assumptions:

  • The .mat file is saved as a version of MAT 7.3 ( which seems hdf5-compatible , although MathWorks does not go as far as guaranteed), or can be saved by writing to hdf5 directly (using MATLAB hdfwrite () ).

  • You can import and use other third-party packages in python, namely pandas .

An approach

Given these assumptions, the approach I would use is:

  • Make sure the .mat file is saved in hdf5 compatible format. This may mean converting it using MATLAB matfile () , which will not load everything to disk or can be executed once on a machine with a large amount of RAM.

  • Use pandas to read part of the .mat file in hdf5 format in the data frame.

  • Use a data frame for your subsequent analysis in python.

Notes:

Pandas data frames work very well with numpy and scipy in general. Therefore, if you can read your data in a frame, you can probably do what you want from there.

The answer to this SO question shows you how to read only part of the hdf5 data file in memory (pandas data frame) at a time, based on a condition (index range or some logical condition like WHERE something = somethingelse).

Mini recitation

MATLAB has been supporting its latest versions of 7.3 MAT files for more than 12 years, but still does not use them as a standard version for saving (in this case, the v7.3 disk space is larger, but more universal for use) - therefore, any user using default settings MATLAB will not generate matfiles v7.3. 12 years, we have a lot of disk space, but this problem still causes problems. It is time to update the default flag, MathWorks !!!!

Hope this helps,

Tom

0
source

All Articles