How to load one line at a time from pickle file?

I have a large dataset: 20,000 x 40,000 as a numpy array. I saved it as a pickle file.

Instead of reading this huge dataset in memory, I would like to read only a few (say, 100) lines at a time, for use as a mini chat.

How can I read only a few randomly selected (without replacing) lines from a pickle file?

+5
source share
3 answers

You can write pickles gradually from a file, which allows you to load them gradually.

Take the following example. Here we go through the list items and pickle each one at a time.

>>> import cPickle >>> myData = [1, 2, 3] >>> f = open('mydata.pkl', 'wb') >>> pickler = cPickle.Pickler(f) >>> for e in myData: ... pickler.dump(e) <cPickle.Pickler object at 0x7f3849818f68> <cPickle.Pickler object at 0x7f3849818f68> <cPickle.Pickler object at 0x7f3849818f68> >>> f.close() 

Now we can do the same process in reverse order and load each object as needed. For example, let's say that we just want the first element and don’t do want to iterate over the whole file.

 >>> f = open('mydata.pkl', 'rb') >>> unpickler = cPickle.Unpickler(f) >>> unpickler.load() 1 

At this point, the file stream only advanced to the first object. The rest of the objects were not loaded, which is exactly the behavior you want. For proof, you can try reading the rest of the file, and the rest is still sitting there.

 >>> f.read() 'I2\n.I3\n.' 
+4
source

Since you do not know the internal operation of the brine, you need to use a different storage method. The script below uses the tobytes() functions to save a string of data in a raw file.

Since the length of each line is known, its offset in the file can be calculated and obtained through seek() and read() . After that, it is converted back to an array using the frombuffer() function.

However, a big disclaimer is that the size of the array is not saved (this can be added, but some more complications are required) and that this method may not be as portable as a pickled array.

As @PadraicCunningham pointed out in comment , memmap is likely to become an alternative and elegant solution.

Performance Note:. After reading the comments, I did a short test. On my machine (16 GB of RAM, encrypted by SSD), I was able to execute 40,000 random lines in 24 seconds (with a matrix of 20,000x40000, of course, not the 10x10 from the example).

 from __future__ import print_function import numpy import random def dumparray(a, path): lines, _ = a.shape with open(path, 'wb') as fd: for i in range(lines): fd.write(a[i,...].tobytes()) class RandomLineAccess(object): def __init__(self, path, cols, dtype): self.dtype = dtype self.fd = open(path, 'rb') self.line_length = cols*dtype.itemsize def read_line(self, line): offset = line*self.line_length self.fd.seek(offset) data = self.fd.read(self.line_length) return numpy.frombuffer(data, self.dtype) def close(self): self.fd.close() def main(): lines = 10 cols = 10 path = '/tmp/array' a = numpy.zeros((lines, cols)) dtype = a.dtype for i in range(lines): # add some data to distinguish lines numpy.ndarray.fill(a[i,...], i) dumparray(a, path) rla = RandomLineAccess(path, cols, dtype) line_indices = list(range(lines)) for _ in range(20): line_index = random.choice(line_indices) print(line_index, rla.read_line(line_index)) if __name__ == '__main__': main() 
+3
source

Thanks to everyone. I ended up looking for a workaround (a machine with a large amount of RAM so that I could actually load the data set into memory).

0
source

All Articles