Efficient storage of scipy / numpy arrays in dictionaries

Question

Efficient storage of scipy / numpy arrays in dictionaries

BACKGROUND

The problem I'm working with is the following:

As part of the experiment that I am developing for my research, I create a large number of large (4M length) arrays that are somewhat sparse and therefore can be stored as scipy.sparse.lil_matrix instances or simply as scipy.array instances (gain / loss space here is not a problem).
Each of these arrays must be conjugated to a string (namely, a word) so that the data makes sense, since they are semantic vectors representing the meaning of this string. I need to keep this mating.
The vectors for each word in the list are built one after another and saved to disk before moving on to the next word.
They should be stored on disk in a way that could then be obtained with dictionary syntax. For example, if all the words are stored in a DB-like file, I need to open this file and do something like vector = wordDB[word] .

CURRENT APPROACH

What I am doing now:

Using shelve to open a shelf named wordDB
Each time a vector (currently using lil_matrix from scipy.sparse ) for a word is built, keeping the vector on a shelf: wordDB[word] = vector
When I need to use vectors during the evaluation, I will do the opposite: open the shelf, and then call the vectors by doing vector = wordDB[word] for each word, since they are necessary so that not all vectors should be stored in RAM (which would be impossible) .

The above “solution” fits my needs in terms of solving the problem as indicated. The problem is that when I want to use this method to create and store vectors for a lot of words, I just run out of disk space.

This is, as far as I can tell, because shelve kindles stored data, which is not an efficient way to store large arrays, which makes this storage problem unsolvable with shelve for the number of words I need to deal with.

PROBLEM

The question is: is there a way to serialize my array set, which would be:

Save the arrays themselves in a compressed binary format, akin to the .npy files generated by scipy.save ?
Observe my requirement that data be read from disk as a dictionary while maintaining the connection between words and arrays?

+7

python numpy scipy serialization

Edward grefenstette Mar 16 '11 at 18:30

source share

4 answers

I would suggest using scipy.save and have dictionaries between the word and the file name.

+2

Xavier combelle Mar 16 '11 at 18:41

source share

Have you tried just using cPickle to sort the dictionary directly using:

 import cPickle DD = dict() f = open('testfile.pkl','wb') cPickle.dump(DD,f,-1) f.close()

Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf, if necessary, as this allows you to open a large array without putting all this into memory at once, and then get the necessary fragments. Then you can link the words as an additional group in the netcdf4 / hdf5 file and use common indexes to quickly link the corresponding fragment from each group or simply name the group as a word and then have the data as a vector. You will need to play with what is more effective.

http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html

Pytables can also be a useful storage layer on top of HDF5:

http://www.pytables.org

+2

Joshdel Mar 16 '11 at 18:41

source share

Avoid using shelve , this is a bug and has cross-platform problems.

The memory issue, however, has nothing to do with shelve . Massive arrays provide an efficient implementation of the brine protocol, and there is little overhead for cPickle.dumps(protocol=-1) compared to binary .npy (basically only extra headers in the brine).

So, if binary / pickle is not enough, you have to go for compression. Take a look at pytables or h5py (the difference between the two ).

If it is enough to specify the binary protocol in pickle , you may consider something easier than hdf5: check sqlitedict for replacement shelve . It has no additional dependencies.

+2

Radim Jul 19 '11 at 21:46

source share

Andrea Zonca · Accepted Answer · 2011-03-16T19:17:26+0000

as JoshAdel already suggested, I would go for HDF5, the easiest way is to use h5py:

http://h5py.alfven.org/

you can attach several attributes to an array with a dictionary, for example sintax:

 dset.attrs["Name"] = "My Dataset"

where dset is your dataset that can be sliced exactly like a numpy array, but in the background it does not load the entire array into memory.

Efficient storage of scipy / numpy arrays in dictionaries

More articles: