Is it possible to map unrelated data on disk to an array with python?

Question

Is it possible to map unrelated data on disk to an array with python?

I want to map a large fortran (12G) record on the hard drive to a numpy array. (Display instead of loading to save memory.)

The data stored in a fortran record is not continuous because it is divided into record markers. The structure of the record is "marker, data, marker, data, ..., data, marker". The length of data areas and markers is known.

The data length between the tokens is not a multiple of 4 bytes, otherwise I can map each data area to an array.

The first marker can be skipped by setting the offset in memmap, is it possible to skip other markers and map data to an array?

Apology for possible ambiguous expression and gratitude for any decision or suggestion.

Edited May 15

These are fortran unformatted files. The data stored in the record is an array (1024 ^ 3) * 3 float32 (12 Gb).

The following is a list of variable length record entries that are larger than 2 gigabytes in size:

data structure

(For more details, see here → section [Record Types] → [Variable Length Records].)

In my case, with the exception of the latter, each subrecord is 2147483639 bytes long and is divided into 8 bytes (as you see in the figure above, the end marker of the previous subrector and the start marker of the next, total 8 bytes).

We can see that the first subrecord ends with the first three bytes of some floating point number, and the second subrecord starts with the remaining 1 byte as 2147483639 mod 4 = 3.

+8

python arrays numpy fortran hdf5

Syrtis Major May 13, '13 at 5:37

source share

1 answer

Saullo Castro · Accepted Answer · 2013-05-16 21:20

I wrote another answer because for the numpy.memmap example given here worked:

 offset = 0 data1 = np.memmap('tmp', dtype='i', mode='r+', order='F', offset=0, shape=(size1)) offset += size1*byte_size data2 = np.memmap('tmp', dtype='i', mode='r+', order='F', offset=offset, shape=(size2)) offset += size1*byte_size data3 = np.memmap('tmp', dtype='i', mode='r+', order='F', offset=offset, shape=(size3))

for int32 byte_size=32/8 , for int16 byte_size=16/8 , etc ...

If the dimensions are constant, you can load the data into a 2D array, for example:

 shape = (total_length/size,size) data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)

You can change the memmap object memmap much as you want. It is even possible to create arrays with the same elements. In this case, changes made in one are automatically updated in the other.

Is it possible to map unrelated data on disk to an array with python?

More articles: