Efficient storage of scipy / numpy arrays in dictionaries

BACKGROUND

The problem I'm working with is the following:

  • As part of the experiment that I am developing for my research, I create a large number of large (4M length) arrays that are somewhat sparse and therefore can be stored as scipy.sparse.lil_matrix instances or simply as scipy.array instances (gain / loss space here is not a problem).

  • Each of these arrays must be conjugated to a string (namely, a word) so that the data makes sense, since they are semantic vectors representing the meaning of this string. I need to keep this mating.

  • The vectors for each word in the list are built one after another and saved to disk before moving on to the next word.

  • They should be stored on disk in a way that could then be obtained with dictionary syntax. For example, if all the words are stored in a DB-like file, I need to open this file and do something like vector = wordDB[word] .

CURRENT APPROACH

What I am doing now:

  • Using shelve to open a shelf named wordDB

  • Each time a vector (currently using lil_matrix from scipy.sparse ) for a word is built, keeping the vector on a shelf: wordDB[word] = vector

  • When I need to use vectors during the evaluation, I will do the opposite: open the shelf, and then call the vectors by doing vector = wordDB[word] for each word, since they are necessary so that not all vectors should be stored in RAM (which would be impossible) .

The above โ€œsolutionโ€ fits my needs in terms of solving the problem as indicated. The problem is that when I want to use this method to create and store vectors for a lot of words, I just run out of disk space.

This is, as far as I can tell, because shelve kindles stored data, which is not an efficient way to store large arrays, which makes this storage problem unsolvable with shelve for the number of words I need to deal with.

PROBLEM

The question is: is there a way to serialize my array set, which would be:

  • Save the arrays themselves in a compressed binary format, akin to the .npy files generated by scipy.save ?

  • Observe my requirement that data be read from disk as a dictionary while maintaining the connection between words and arrays?

+7
source share
4 answers

as JoshAdel already suggested, I would go for HDF5, the easiest way is to use h5py:

http://h5py.alfven.org/

you can attach several attributes to an array with a dictionary, for example sintax:

 dset.attrs["Name"] = "My Dataset" 

where dset is your dataset that can be sliced โ€‹โ€‹exactly like a numpy array, but in the background it does not load the entire array into memory.

+4
source

I would suggest using scipy.save and have dictionaries between the word and the file name.

+2
source

Have you tried just using cPickle to sort the dictionary directly using:

 import cPickle DD = dict() f = open('testfile.pkl','wb') cPickle.dump(DD,f,-1) f.close() 

Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf, if necessary, as this allows you to open a large array without putting all this into memory at once, and then get the necessary fragments. Then you can link the words as an additional group in the netcdf4 / hdf5 file and use common indexes to quickly link the corresponding fragment from each group or simply name the group as a word and then have the data as a vector. You will need to play with what is more effective.

http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html

Pytables can also be a useful storage layer on top of HDF5:

http://www.pytables.org

+2
source

Avoid using shelve , this is a bug and has cross-platform problems.

The memory issue, however, has nothing to do with shelve . Massive arrays provide an efficient implementation of the brine protocol, and there is little overhead for cPickle.dumps(protocol=-1) compared to binary .npy (basically only extra headers in the brine).

So, if binary / pickle is not enough, you have to go for compression. Take a look at pytables or h5py (the difference between the two ).

If it is enough to specify the binary protocol in pickle , you may consider something easier than hdf5: check sqlitedict for replacement shelve . It has no additional dependencies.

+2
source

All Articles