The fastest way to read a binary with a specific format?

I have large binary data files that have a predefined format originally written by Fortran as a small number. I would like to read these files in the fastest and most efficient way, so using the array package looked as close as possible to my lane, a href = "/ questions / 399237 / improve-speed-of-reading-and-converting-from-binary- file "> here.

The problem is that the given format is heterogeneous. It looks something like this: ['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']

with every integer i occupying 4 bytes, and each double d takes 8 bytes.

Is there any way I can still use the super efficient package array (or another suggestion), but with the correct format?

0
source share
5 answers

There are many useful and useful answers here, but I think the best solution needs a more detailed explanation. I implemented a method that reads the entire data file in one pass using the built-in read() and simultaneously creates a numpy ndarray . This is more efficient than reading data and building the array separately, but it is also a bit more skillful.

 line_cols = 20 #For example line_rows = 40000 #For example data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats) data_bsize = 15*8 + 4*5 #For example with open(filename,'rb') as f: data = np.ndarray(shape=(1,line_rows), dtype=np.dtype(data_fmt), buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1] 

Here we open the file as a binary file using the 'rb' option in open . Then we build our ndarray with the correct form and dtype to fit our read buffer. Then we reduce the ndarray by a 1D array, taking its zero index, where all our data is hidden. Then we transform the array using the np.astype , np.view and np.reshape . This is because np.reshape does not like to have data with mixed dtypes, and I agree that my integers are expressed as doubles.

This method is ~ 100 times faster than a loop for a line through the data, and can potentially be compressed into one line of code.

In the future, I can try to read data even faster using the Fortran script, which essentially converts the binary to a text file. I don't know if it will be faster, but it might be worth a try.

0
source

Use a struct . In particular, struct.unpack .

 result = struct.unpack("<2i5d...", buffer) 

Here buffer contains the given binary data.

+3
source

It is not clear from your question if you are worried about the actual speed of the read file (and the construction of the data structure in memory) or about the later processing speeds.

If you read only once and do heavy processing later, you can read the file record by record (if your binary data is a set of records of repeated records with the same format), analyze it with struct.unpack and add it to the [double] array:

 from functools import partial data = array.array('d') record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles with open('input', 'rb') as fin: for record in iter(partial(fin.read, record_size_in_bytes), b''): values = struct.unpack("<2i5d...", record) data.extend(values) 

Under the assumption, you are allowed to discard all your int on double and are ready to accept an increase in the amount of allocated memory (22% for your entry from the question).

If you read data from a file many times, it would be advisable to convert everything into one large array from double (for example, above) and write it back to another file, from which you can later read using array.fromfile() :

 data = array.array('d') with open('preprocessed', 'rb') as fin: n = os.fstat(fin.fileno()).st_size // 8 data.fromfile(fin, n) 

Update . Thanks to the good standard from @martineau , we now know that preprocessing the data and turning it into a homogeneous array of twins ensures that such data is loaded from a file (with array.fromfile() ) faster than ~20x to ~40x than reading a record for writing, unpacking and adding to array (as shown in the first code list above).

A faster (and more standard) variation of the record by record in @martineau's answer, which is added to the list and does not increase to double , is only ~6x to ~10x slower than the array.fromfile() method and seems to be a better reference standard.

+3
source

Main update: Changed to use the correct code for reading in a file with a pre-processed array ( using_preprocessed_file() function below), which dramatically changed the results.

To determine which method works faster in Python (using only built-in modules and standard libraries), I created a script for comparing (via timeit ) various methods that could be used to do this. This is a bit on the long side, so in order to avoid distraction, I send only those tests and related results. (If you have enough interest in the methodology, I will post the entire script.)

The following are snippets of code that have been mapped:

 @TESTCASE('Read and constuct piecemeal with struct') def read_file_piecemeal(): structures = [] with open(test_filenames[0], 'rb') as inp: size = fmt1.size while True: buffer = inp.read(size) if len(buffer) != size: # EOF? break structures.append(fmt1.unpack(buffer)) return structures @TESTCASE('Read all-at-once, then slice and struct') def read_entire_file(): offset, unpack, size = 0, fmt1.unpack, fmt1.size structures = [] with open(test_filenames[0], 'rb') as inp: buffer = inp.read() # read entire file while True: chunk = buffer[offset: offset+size] if len(chunk) != size: # EOF? break structures.append(unpack(chunk)) offset += size return structures @TESTCASE('Convert to array (@randomir part 1)') def convert_to_array(): data = array.array('d') record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles (standard sizes) with open(test_filenames[0], 'rb') as fin: for record in iter(partial(fin.read, record_size_in_bytes), b''): values = struct.unpack("<2i5d2idi3d2i3didi3d", record) data.extend(values) return data @TESTCASE('Read array file (@randomir part 2)', setup='create_preprocessed_file') def using_preprocessed_file(): data = array.array('d') with open(test_filenames[1], 'rb') as fin: n = os.fstat(fin.fileno()).st_size // 8 data.fromfile(fin, n) return data def create_preprocessed_file(): """ Save array created by convert_to_array() into a separate test file. """ test_filename = test_filenames[1] if not os.path.isfile(test_filename): # doesn't already exist? data = convert_to_array() with open(test_filename, 'wb') as file: data.tofile(file) 

And here are the results that they performed on my system:

 Fastest to slowest execution speeds using Python 3.6.1 (10 executions, best of 3 repetitions) Size of structure: 164 Number of structures in test file: 40,000 file size: 6,560,000 bytes Read array file (@randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower) Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower) Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower) Convert to array (@randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower) 

Interestingly, most fragments are really faster in Python 2 ...

 Fastest to slowest execution speeds using Python 2.7.13 (10 executions, best of 3 repetitions) Size of structure: 164 Number of structures in test file: 40,000 file size: 6,560,000 bytes Read array file (@randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower) Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower) Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower) Convert to array (@randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower) 
+2
source

Take a look at the documentation for the numpy fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/ reference / arrays.dtypes.html # arrays-dtypes-constructing

The simplest example:

 import numpy as np data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...')) 

Read more about "Structured Arrays" in numpy and how to specify their data types here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#

0
source

All Articles