Lazy Evaluation to Iterate Using NumPy Arrays

I have a Python program that processes fairly large NumPy arrays (in hundreds of megabytes) that are stored on disk in pickle files (one array ~ 100 MB in size per file). When I want to run a query on the data, I load the entire array using pickle and then execute the query (so from the point of view of the Python program, the whole array is in memory, even if the OS replaces it), I did this mainly because I considered that the ability to use vectorized operations on NumPy arrays will be significantly faster than using for loops through each element.

I run this on a web server with memory constraints that I quickly run into. I have many different queries that I run on the data, so writing a "chunking" code that loads parts of the data from individual pickle files, processes them, and then moves on to the next snippet is likely to add a lot of complexity. Of course, it would be preferable to make this "chunking" transparent to any function that processes these large arrays.

It seems like an ideal solution would be something like a generator that periodically loads a block of data from a disk and then transfers the values โ€‹โ€‹of the array one by one. This will significantly reduce the amount of memory required by the program, without requiring additional work from the individual request functions. Is it possible to do something like this?

+4
source share
3 answers

PyTables is a package for managing hierarchical data sets. It is designed to solve this problem for you.

+9
source

A NumPy memory mapping data structure ( memmap ) may be a good choice here.

You get access to NumPy arrays from a binary file on disk, without immediately loading the entire file into memory.

(Note, I believe, but I'm not sure if the Numpys memmap object does not match Pythons - in particular, NumPys is like an array, Python is like a file.)

Method Signature:

A = NP.memmap(filename, dtype, mode, shape, order='C') 

All arguments are simple (that is, they have the same meaning as in other places in NumPy), with the exception of "order", which refers to the order of the ndarray memory layout. I assume that the default is โ€œCโ€, and (only) another option is โ€œFโ€, for Fortran - as elsewhere, these two options represent the order of rows and columns, respectively.

Two methods:

flush (which writes to disk any changes you make to the array); and

close (which writes data to the memmap array or, more precisely, to an array-like memory card for data stored on disk)

usage example:

 import numpy as NP from tempfile import mkdtemp import os.path as PH my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10) my_data = NP.array(my_data, dtype="float") fname = PH.join(mkdtemp(), 'tempfile.dat') mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10) # now write the data to the memmap array: mm_obj[:] = data[:] # reload the memmap: mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10)) # verify that it there!: print(mm_obj[:20,:]) 
+4
source

It seems that the ideal solution would be something like a generator that periodically loads a block of data from the disk, and then the values โ€‹โ€‹of the array are transmitted alternately. This will significantly reduce the amount of memory required by the program without requiring additional work as part of a separate function request. Can this be done? something like that?

Yes, but not by storing arrays on disk in a single brine - the brine protocol is simply not designed for "incremental deserialization."

You can write multiple pickles to the same open file one after another (use dump , not dumps ), and then the "lazy evaluator for iteration" just has to use pickle.load every time.

Code example (Python 3.1 - in 2.any you want cPickle instead of pickle and -1 for the protocol, etc., of course cPickle :

 >>> import pickle >>> lol = [range(i) for i in range(5)] >>> fp = open('/tmp/bah.dat', 'wb') >>> for subl in lol: pickle.dump(subl, fp) ... >>> fp.close() >>> fp = open('/tmp/bah.dat', 'rb') >>> def lazy(fp): ... while True: ... try: yield pickle.load(fp) ... except EOFError: break ... >>> list(lazy(fp)) [range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)] >>> fp.close() 
+2
source

All Articles