How can I read sequential arrays from a binary using `np.fromfile`?

I want to read a binary in Python, the exact location of which is stored in the binary itself.

The file contains a sequence of two-dimensional arrays with the sizes of the rows and columns of each array, stored as a pair of integers preceding its contents. I want to read all the arrays contained in a file sequentially.

I know this can be done with f = open("myfile", "rb") and f.read(numberofbytes) , but this is pretty awkward because then I will need to convert the output to meaningful data structures. I would like to use numpy np.fromfile with a custom dtype , but did not find a way to read part of the file, leaving it open, and then continue reading with the modified dtype .

I know that I can use os before f.seek(numberofbytes, os.SEEK_SET) and np.fromfile several times, but that would mean a lot of unnecessary jumps in the file.

In short, I want MATLAB fread (or at least something like C ++ ifstream read ).

What is the best way to do this?

+6
source share
1 answer

You can transfer the object of the open file np.fromfile , read the dimensions of the first array, then read the contents of the array (again using np.fromfile ) and repeat the process for additional arrays in the same file.

For instance:

 import numpy as np import os def iter_arrays(fname, array_ndim=2, dim_dtype=np.int, array_dtype=np.double): with open(fname, 'rb') as f: fsize = os.fstat(f.fileno()).st_size # while we haven't yet reached the end of the file... while f.tell() < fsize: # get the dimensions for this array dims = np.fromfile(f, dim_dtype, array_ndim) # get the array contents yield np.fromfile(f, array_dtype, np.prod(dims)).reshape(dims) 

Usage example:

 # write some random arrays to an example binary file x = np.random.randn(100, 200) y = np.random.randn(300, 400) with open('/tmp/testbin', 'wb') as f: np.array(x.shape).tofile(f) x.tofile(f) np.array(y.shape).tofile(f) y.tofile(f) # read the contents back x1, y1 = iter_arrays('/tmp/testbin') # check that they match the input arrays assert np.allclose(x, x1) and np.allclose(y, y1) 

If the arrays are large, you can use np.memmap with the offset= parameter instead of np.fromfile to get the contents of the arrays as memory cards, rather than loading them into RAM.

+4
source

All Articles