How can numpy be much faster than my Fortran program?

Question

How can numpy be much faster than my Fortran program?

I get an array of 512 ^ 3 representing the temperature distribution from the simulation (written in Fortran). The array is stored in a binary file about 1 / 2G in size. I need to know the minimum, maximum and average of this array, and since I will soon need to understand the Fortran code, I decided to give it a reason and came up with the following very easy procedure.

integer gridsize,unit,j real mini,maxi double precision mean gridsize=512 unit=40 open(unit=unit,file='T.out',status='old',access='stream',& form='unformatted',action='read') read(unit=unit) tmp mini=tmp maxi=tmp mean=tmp do j=2,gridsize**3 read(unit=unit) tmp if(tmp>maxi)then maxi=tmp elseif(tmp<mini)then mini=tmp end if mean=mean+tmp end do mean=mean/gridsize**3 close(unit=unit)

It takes about 25 seconds per file on the machine I use. This seemed pretty long to me, so I went ahead and did the following in Python:

  import numpy mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\ shape=(512,512,512),order='F') mini=numpy.amin(mmap) maxi=numpy.amax(mmap) mean=numpy.mean(mmap)

Now I expected it to be faster, but I really was blown away. At the same time, less than a second is required. The average deviates from one of my Fortran routines (which I also ran with 128-bit floats, so I somehow trust it more), but only on the 7th significant digit or so.

How can numpy be so fast? I mean, you have to look at each array entry to find these values, right? Am I doing something very stupid in my Fortran program to take a lot more time?

EDIT:

To answer questions in the comments:

Yes, I also ran the Fortran routine with 32-bit and 64-bit floats, but did not affect performance.
I used iso_fortran_env , which provides 128-bit floats.
Using 32-bit floats, my average is quite large, so accuracy is really a problem.
I ran both procedures in different files in a different order, so caching should be fair in comparison, I think?
I actually tried open MP, but read from a file at different positions at the same time. After reading your comments and answers, it sounds really stupid now, and it made the procedure take a lot longer. I could give him a try massive operations, but maybe it would not even be needed.
The files are actually 1 / 2G in size, it was a typo, thanks.
Now I will try the implementation of the array.

EDIT 2:

I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy , but now I have a problem with accuracy, which, as @Luaan noted, I can get. Using a 32-bit floating-point array, the average value calculated on sum is 20%. Performance

 ... real,allocatable :: tmp (:,:,:) double precision,allocatable :: tmp2(:,:,:) ... tmp2=tmp mean=sum(tmp2)/size(tmp) ...

Solves the problem, but increases the computational time (not very, but noticeably). Is there a better way around this problem? I could not find a way to read singles from a file directly in doubles. And how numpy avoid this?

Thanks for all the help so far.

+76

performance python arrays numpy fortran

user35915 Nov 15 '15 at 19:10

source share

2 answers

The integer is faster because you write much more efficient code in python (and most of the numpy database is written in optimized Fortran and C) and terribly inefficient code in Fortran.

Look at your Python code. You load the entire array at once, and then call functions that can work with the array.

Look at your fortran code. You read one value at a time and do some branching logic with it.

Most of your inconsistencies are fragmented IOs written by you in Fortran.

You can write Fortran in much the same way that it is written in python, and you will find that it works much faster this way.

 program test implicit none integer :: gridsize, unit real :: mini, maxi, mean real, allocatable :: array(:,:,:) gridsize=512 allocate(array(gridsize,gridsize,gridsize)) unit=40 open(unit=unit, file='T.out', status='old', access='stream',& form='unformatted', action='read') read(unit) array maxi = maxval(array) mini = minval(array) mean = sum(array)/size(array) close(unit) end program test

+54

casey Nov 15 '15 at 20:18

source share

Alexander Vogt · Accepted Answer · 2015-11-15 20:07

There are two main drawbacks to your Fortran implementation:

You mix IO and calculations (and read from a file record by record).
You do not use vector / matrix operations.

This implementation performs the same operation as yours and runs 20 times faster on my machine:

 program test integer gridsize,unit real mini,maxi,mean real, allocatable :: tmp (:,:,:) gridsize=512 unit=40 allocate( tmp(gridsize, gridsize, gridsize)) open(unit=unit,file='T.out',status='old',access='stream',& form='unformatted',action='read') read(unit=unit) tmp close(unit=unit) mini = minval(tmp) maxi = maxval(tmp) mean = sum(tmp)/gridsize**3 print *, mini, maxi, mean end program

The idea is to read the entire file into one tmp array at a time. Then I can use the MAXVAL , MINVAL and SUM functions directly in the array.

For the accuracy problem: just use double precision values and do the conversion on the fly as

 mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only slightly increases the calculation time. I tried to perform an operation with elements and slices, but this only increased the required time at the optimization level by default.

In -O3 element addition performs 3% better than an array operation. The difference between double and single precision operations is less than 2% on my machine - on average (individual runs deviate much more).

Here is a very fast implementation using LAPACK:

 program test integer gridsize,unit, i, j real mini,maxi integer :: t1, t2, rate real, allocatable :: tmp (:,:,:) real, allocatable :: work(:) ! double precision :: mean real :: mean real :: slange call system_clock(count_rate=rate) call system_clock(t1) gridsize=512 unit=40 allocate( tmp(gridsize, gridsize, gridsize), work(gridsize)) open(unit=unit,file='T.out',status='old',access='stream',& form='unformatted',action='read') read(unit=unit) tmp close(unit=unit) mini = minval(tmp) maxi = maxval(tmp) ! mean = sum(tmp)/gridsize**3 ! mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0)) mean = 0.d0 do j=1,gridsize do i=1,gridsize mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work) enddo !i enddo !j mean = mean / gridsize**3 print *, mini, maxi, mean call system_clock(t2) print *,real(t2-t1)/real(rate) end program

This uses a 1-norm SLANGE single-precision matrix for matrix columns. Runtime is even faster than an approach using single precision functions and does not show the problem of accuracy.

How can numpy be much faster than my Fortran program?

More articles: