BACKGROUND
The problem I'm working with is the following:
As part of the experiment that I am developing for my research, I create a large number of large (4M length) arrays that are somewhat sparse and therefore can be stored as scipy.sparse.lil_matrix instances or simply as scipy.array instances (gain / loss space here is not a problem).
Each of these arrays must be conjugated to a string (namely, a word) so that the data makes sense, since they are semantic vectors representing the meaning of this string. I need to keep this mating.
The vectors for each word in the list are built one after another and saved to disk before moving on to the next word.
They should be stored on disk in a way that could then be obtained with dictionary syntax. For example, if all the words are stored in a DB-like file, I need to open this file and do something like vector = wordDB[word] .
CURRENT APPROACH
What I am doing now:
Using shelve to open a shelf named wordDB
Each time a vector (currently using lil_matrix from scipy.sparse ) for a word is built, keeping the vector on a shelf: wordDB[word] = vector
When I need to use vectors during the evaluation, I will do the opposite: open the shelf, and then call the vectors by doing vector = wordDB[word] for each word, since they are necessary so that not all vectors should be stored in RAM (which would be impossible) .
The above โsolutionโ fits my needs in terms of solving the problem as indicated. The problem is that when I want to use this method to create and store vectors for a lot of words, I just run out of disk space.
This is, as far as I can tell, because shelve kindles stored data, which is not an efficient way to store large arrays, which makes this storage problem unsolvable with shelve for the number of words I need to deal with.
PROBLEM
The question is: is there a way to serialize my array set, which would be:
Save the arrays themselves in a compressed binary format, akin to the .npy files generated by scipy.save ?
Observe my requirement that data be read from disk as a dictionary while maintaining the connection between words and arrays?
Edward grefenstette
source share