Numpy: efficiently reading a large array

I have a binary file containing a dense n*m matrix of 32-bit floats. What is the most efficient way to read it into a Fortran-built numpy array?

The file is several gigabytes in size. I get format control, but it should be compact (i.e., about 4*n*m bytes in length) and should be easily created from non-Python code.

edit : it is imperative that this method directly produces a Fortran ordered matrix (due to the size of the data, I cannot afford to create a C-ordered matrix and then convert it to a separate copy ordered in Fortran.)

+8
performance python numpy scipy large-files
source share
2 answers

NumPy provides fromfile() for reading binary data.

 a = numpy.fromfile("filename", dtype=numpy.float32) 

will create a one-dimensional array containing your data. To access it as a two-dimensional fortran-ordered nxm matrix, you can change it:

 a = a.reshape((n, m), order="FORTRAN") 

[EDIT: reshape() actually copies the data in this case (see comments). To do this without using, use

 a = a.reshape((m, n)).T 

Thanks to Joe Kingion for this.]

But to be honest, if your matrix has several gigabytes, I would choose an HDF5 tool like h5py or PyTables . Both tools contain FAQ entries comparing the tool with the other. I usually prefer h5py, although PyTables are usually used more often (and the scale of both projects is slightly different).

HDF5 files can be written in most programming languages ​​used in data analysis. The list of interfaces in the related Wikipedia article is not complete, for example, there is also an R-interface . But I really don't know which language you want to use to write data ...

+12
source share

Basically, Numpy stores arrays as flat vectors. Multiple dimensions are just an illusion created by the different views and steps that the Numpy iterator uses.

For a detailed but easy explanation of how Numpy works, see the excellent chapter 19, Copy of the Beatiful Book .

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or stored order ('A'). Also see the Question How to force the order of a numpy array for fortran style?

Example with default indexing C ( row order):

 >>> a = np.arange(12).reshape(3,4) # <- C order by default >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a[1] array([4, 5, 6, 7]) >>> a.strides (32, 8) 

Indexing using Fortran order ( column order ):

 >>> a = np.arange(12).reshape(3,4, order='F') >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]]) >>> a[1] array([ 1, 4, 7, 10]) >>> a.strides (8, 24) 

Another kind

In addition, you can always get a different view using the array parameter T:

 >>> a = np.arange(12).reshape(3,4, order='C') >>> aT array([[ 0, 4, 8], [ 1, 5, 9], [ 2, 6, 10], [ 3, 7, 11]]) >>> a = np.arange(12).reshape(3,4, order='F') >>> aT array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) 

You can also manually set the steps:

 >>> a = np.arange(12).reshape(3,4, order='C') >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a.strides (32, 8) >>> a.strides = (8, 24) >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]]) 
+1
source share

Source: https://habr.com/ru/post/649865/


All Articles