Numpy: efficiently reading a large array

Question

Numpy: efficiently reading a large array

I have a binary file containing a dense n*m matrix of 32-bit floats. What is the most efficient way to read it into a Fortran-built numpy array?

The file is several gigabytes in size. I get format control, but it should be compact (i.e., about 4*n*m bytes in length) and should be easily created from non-Python code.

edit : it is imperative that this method directly produces a Fortran ordered matrix (due to the size of the data, I cannot afford to create a C-ordered matrix and then convert it to a separate copy ordered in Fortran.)

+8

performance python numpy scipy large-files

NPE Dec 6 '10 at 11:36

source share

2 answers

Basically, Numpy stores arrays as flat vectors. Multiple dimensions are just an illusion created by the different views and steps that the Numpy iterator uses.

For a detailed but easy explanation of how Numpy works, see the excellent chapter 19, Copy of the Beatiful Book .

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or stored order ('A'). Also see the Question How to force the order of a numpy array for fortran style?

Example with default indexing C ( row order):

 >>> a = np.arange(12).reshape(3,4) # <- C order by default >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a[1] array([4, 5, 6, 7]) >>> a.strides (32, 8)

Indexing using Fortran order ( column order ):

 >>> a = np.arange(12).reshape(3,4, order='F') >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]]) >>> a[1] array([ 1, 4, 7, 10]) >>> a.strides (8, 24)

Another kind

In addition, you can always get a different view using the array parameter T:

 >>> a = np.arange(12).reshape(3,4, order='C') >>> aT array([[ 0, 4, 8], [ 1, 5, 9], [ 2, 6, 10], [ 3, 7, 11]]) >>> a = np.arange(12).reshape(3,4, order='F') >>> aT array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])

You can also manually set the steps:

 >>> a = np.arange(12).reshape(3,4, order='C') >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a.strides (32, 8) >>> a.strides = (8, 24) >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]])

+1

peterhil Jul 27 '12 at 1:39

source share

Sven marnach · Accepted Answer · 2010-12-06T12:28:27+0000

NumPy provides fromfile() for reading binary data.

 a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional fortran-ordered nxm matrix, you can change it:

 a = a.reshape((n, m), order="FORTRAN")

[EDIT: reshape() actually copies the data in this case (see comments). To do this without using, use

 a = a.reshape((m, n)).T

Thanks to Joe Kingion for this.]

But to be honest, if your matrix has several gigabytes, I would choose an HDF5 tool like h5py or PyTables . Both tools contain FAQ entries comparing the tool with the other. I usually prefer h5py, although PyTables are usually used more often (and the scale of both projects is slightly different).

HDF5 files can be written in most programming languages used in data analysis. The list of interfaces in the related Wikipedia article is not complete, for example, there is also an R-interface . But I really don't know which language you want to use to write data ...

Numpy: efficiently reading a large array

Example with default indexing C ( row order):

Indexing using Fortran order ( column order ):

Another kind

You can also manually set the steps:

More articles: