Fortran, Numpy, Cython, and Numexpr Performance Comparison

I have the following function:

def get_denom(n_comp,qs,x,cp,cs): ''' len(n_comp) = 1 # number of proteins len(cp) = n_comp # protein concentration len(qp) = n_comp # protein capacity len(x) = 3*n_comp + 1 # fit parameters len(cs) = 1 ''' k = x[0:n_comp] sigma = x[n_comp:2*n_comp] z = x[2*n_comp:3*n_comp] a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp denom = np.sum(a) + cs return denom 

I compare it with a Fortran implementation (my first Fortran function ever):

 subroutine get_denom (qs,x,cp,cs,n_comp,denom) ! Calculates the denominator in the SMA model (Brooks and Cramer 1992) ! The function is called at a specific salt concentration and isotherm point ! I loops over the number of components implicit none ! declaration of input variables integer, intent(in) :: n_comp ! number of components double precision, intent(in) :: cs,qs ! salt concentration, free ligand concentration double precision, dimension(n_comp), INTENT(IN) ::cp ! protein concentration double precision, dimension(3*n_comp + 1), INTENT(IN) :: x ! parameters ! declaration of local variables double precision, dimension(n_comp) :: k,sigma,z double precision :: a integer :: i ! declaration of outpur variables double precision, intent(out) :: denom k = x(1:n_comp) ! equlibrium constant sigma = x(n_comp+1:2*n_comp) ! steric hindrance factor z = x(2*n_comp+1:3*n_comp) ! charge of protein a = 0. do i = 1,n_comp a = a + (sigma(i) + z(i))*(k(i)*(qs/cs)**(z(i)-1.))*cp(i) end do denom = a + cs end subroutine get_denom 

I compiled the .f95 file using:

1) f2py -c -m get_denom get_denom.f95 --fcompiler=gfortran

2) f2py -c -m get_denom_vec get_denom.f95 --fcompiler=gfortran --f90flags='-msse2' (The last option should include auto-vectorization)

I test the functions as follows:

 import numpy as np import get_denom as fort_denom import get_denom_vec as fort_denom_vec from matplotlib import pyplot as plt %matplotlib inline def get_denom(n_comp,qs,x,cp,cs): k = x[0:n_comp] sigma = x[n_comp:2*n_comp] z = x[2*n_comp:3*n_comp] # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992) a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp denom = np.sum(a) + cs return denom n_comp = 100 cp = np.tile(1.243,n_comp) cs = 100. qs = np.tile(1100.,n_comp) x= np.random.rand(3*n_comp+1) denom = np.empty(1) %timeit get_denom(n_comp,qs,x,cp,cs) %timeit fort_denom.get_denom(qs,x,cp,cs,n_comp) %timeit fort_denom_vec.get_denom(qs,x,cp,cs,n_comp) 

I added the following Cython code:

 import cython # import both numpy and the Cython declarations for numpy import numpy as np cimport numpy as np @cython.boundscheck(False) @cython.wraparound(False) def get_denom(int n_comp,np.ndarray[double, ndim=1, mode='c'] qs, np.ndarray[double, ndim=1, mode='c'] x,np.ndarray[double, ndim=1, mode='c'] cp, double cs): cdef int i cdef double a cdef double denom cdef double[:] k = x[0:n_comp] cdef double[:] sigma = x[n_comp:2*n_comp] cdef double[:] z = x[2*n_comp:3*n_comp] # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992) a = 0. for i in range(n_comp): #a += (sigma[i] + z[i])*( pow( k[i]*(qs[i]/cs), (z[i]-1) ) )*cp[i] a += (sigma[i] + z[i])*( k[i]*(qs[i]/cs)**(z[i]-1) )*cp[i] denom = a + cs return denom 

EDIT:

Added Numexpr using a single thread:

 def get_denom_numexp(n_comp,qs,x,cp,cs): k = x[0:n_comp] sigma = x[n_comp:2*n_comp] z = x[2*n_comp:3*n_comp] # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992) a = ne.evaluate('(sigma + z)*( k*(qs/cs)**(z-1) )*cp' ) return cs + np.sum(a) ne.set_num_threads(1) # using just 1 thread %timeit get_denom_numexp(n_comp,qs,x,cp,cs) 

Result (less):

enter image description here

Why is Fortran speed approaching Numpy with an increase in array size? And how could I speed up Cython? Using pointers?

+7
python numpy fortran cython
source share
4 answers

Sussed it.

OK, finally, we were allowed to install Numpy, etc. in one of our mailboxes, and this made it possible to fully explain your initial message.

The short answer is that your initial questions are, in a sense, the "wrong question." In addition, one of the participants did a lot of annoying abuse and misinformation, and these mistakes and fabrications deserve attention so that someone does not make a mistake, considering them and their cost.

In addition, I decided to send this answer as a separate answer, instead of editing my answer of April 14 for the reasons given below and decency.

Part A: Response to OP

First of all, we are talking about a question in the original post: you can recall that I could only comment on wrt on the side of Fortran, since our rules are strict regarding what software can be installed and where on our machines, and we do not have Python etc. (just now). I have also repeatedly stated that the nature of your result was interesting in terms of what we can call it a curved character or, possibly, a concavity.

In addition, working exclusively with ā€œrelativeā€ results (since you did not publish absolute timings, and I didn’t have Numpy at that time), I pointed out several times that some important information might be hidden in them.

That is exactly so.

Firstly, I wanted to be sure that I can reproduce your results, since we do not use Python / F2py normally, it is not clear what compiler options, etc. implied in your results, so I performed many tests of course, we are talking about apples, apples (my answer of April 14 showed that Debug vs Release / O2 is of great importance).

Figure 1 shows my Python results for only three cases: your original Python / Numpy internal routine (call it P, I just cut / paste your original), your original Do based Fortran s / r that you published (name this FDo, I just copied / pasted your original, as before), and one of the options that I suggested earlier, relying on sections of the array and thus requiring Sum () (let's call this FSAS created by editing the original FDo). Figure 1 shows the absolute timings through timeit. Figure 1

Figure 2 shows the relative results based on your relative Python / Numpy (P) timings strategy. Only two (relative) Fortran variants are displayed. Figure 2

It is clear that those who reproduce the character in your original plot, and we can be sure that my tests seem to be consistent with your tests.

Now your original question: "Why is Fortran speed approaching Numpy with an increase in the size of arrays?"

This is actually not the case. This is a pure artifact / distortion based solely on the ā€œrelativeā€ timings that can give this impression.

Looking at Figure 1, with absolute timings, it is clear that Numpy and Fortran diverge. So, in fact, Fortran's results are moving away from Numpy or vice versa if you want.

The best question, and one that has repeatedly arisen in my previous comments, is why are these upward bends in the first place (for example, linear)? My previous results only at Fortran showed a ā€œmostlyā€ flat relative performance ratio (the yellow lines in my diagram / answer for April 14th April), and so I suggested that there was something interesting on the Python side and suggested checking this out.

One way to show this is another relative measure. I divided each (absolute) series with its own highest value (i.e., At n_comp = 10k) to see how this ā€œinternal relativeā€ performance unfolds (they are called 10k values ​​representing timings for n_comp = 10,000), Figure 3 shows these results for P, FDo, and FSAS as P / P10k, FDo / FDo10k, and FSAS / FSAS10k. For clarity, the y axis has a logarithmic scale. It is understood that Fortran blank options are relatively much better with decreasing n_comp cf P results (for example, annotated red circle). Figure 3

In another way, Fortran is very (non-linear) more efficient at reducing array size. It is not clear why Python will be much worse with n_comp decreasing ... but there it can be a problem with internal utility / settings, etc., As well as differences between interpreters and compilers, etc.

So, it’s not that Fortran is ā€œcatching upā€ with Python, quite the contrary, it continues to distance itself from Python (see Figure 1). However, differences in non-linearities between Python and Fortran with a decrease in n_comp produce ā€œrelativeā€ timings with a clearly contradictory and non-linear character.

Thus, as n_comp and each method increase, it ā€œstabilizesā€ to a more or less linear mode, the curves align to show that their differences grow linearly, and the relative relations are consistent with an approximate constant (ignoring memory problems, problems with smp, etc. .) ... this is easier to see if n_comp> 10k is allowed, but the yellow line in the response on April 14 already shows this only for files with Fortran support.

Other than that, I prefer to create my own temporary procedures / functions. timeit seems convenient, but much happens inside this black box. Customizing your own loops and structures, as well as having the properties / resolution of your synchronization functions, is important for a proper evaluation.

+3
source share

Being named in another answer, I have to answer.

I know that this does not really answer the original question, but the original poster encouraged to pursue this direction in its comments.

My points are as follows:

1. I do not think that the built-in array is better optimized. If someone is lucky, they translate into the same loop code as manual loops. If this is not the case, performance issues may occur. Therefore, one must be careful. It is possible to run temporary arrays.

I translated the proposed SAS arrays into a regular do loop. I call it DOS. I demonstrate that DO loops are no slower, both routines lead to more or less the same code in this case.

 qsDcs = qs/cs denom = 0 do j = 1, n_comp denom = denom + (x(n_comp+j) + x(2*n_comp+j)) * (x(j)*(qsDcs)**(x(2*n_comp+j)-1))*cp(j) end do denom = denom + cs 

It is important to say that I do not believe that this version is less readable just because it has one or two lines. Actually, it's pretty easy to understand what is going on here.

Now timing for these

 f2py -c -m sas sas.f90 --opt='-Ofast' f2py -c -m dos dos.f90 --opt='-Ofast' In [24]: %timeit test_sas(10000) 1000 loops, best of 3: 796 µs per loop In [25]: %timeit test_sas(10000) 1000 loops, best of 3: 793 µs per loop In [26]: %timeit test_dos(10000) 1000 loops, best of 3: 795 µs per loop In [27]: %timeit test_dos(10000) 1000 loops, best of 3: 797 µs per loop 

They are the same. The intrinsics array and array expression arithmetic have no hidden performance magic. In this case, they simply translate into loops under the hood.

If you check the generated GIMPLE code and both SAS and DOS are transferred to the same main block of optimized code, the magic version of SUM() will not be called here:

  <bb 8>: # val.8_59 = PHI <val.8_49(9), 0.0(7)> # ivtmp.18_123 = PHI <ivtmp.18_122(9), 0(7)> # ivtmp.25_121 = PHI <ivtmp.25_120(9), ivtmp.25_117(7)> # ivtmp.28_116 = PHI <ivtmp.28_115(9), ivtmp.28_112(7)> _111 = (void *) ivtmp.25_121; _32 = MEM[base: _111, index: _106, step: 8, offset: 0B]; _36 = MEM[base: _111, index: _99, step: 8, offset: 0B]; _37 = _36 + _32; _40 = MEM[base: _111, offset: 0B]; _41 = _36 - 1.0e+0; _42 = __builtin_pow (qsdcs_18, _41); _97 = (void *) ivtmp.28_116; _47 = MEM[base: _97, offset: 0B]; _43 = _40 * _47; _44 = _43 * _42; _48 = _44 * _37; val.8_49 = val.8_59 + _48; ivtmp.18_122 = ivtmp.18_123 + 1; ivtmp.25_120 = ivtmp.25_121 + _118; ivtmp.28_115 = ivtmp.28_116 + _113; if (ivtmp.18_122 == _96) goto <bb 10>; else goto <bb 9>; <bb 9>: goto <bb 8>; <bb 10>: # val.8_13 = PHI <val.8_49(8), 0.0(6)> _51 = val.8_13 + _17; *denom_52(D) = _51; 

the code is functionally identical to the version of the do loop, just the name of the variables is different.


2. They suggested that the arguments to the form array (:) could slow performance. While the argument obtained in the argument of the intended size (*) or explicit size (n) is always simply contiguous, the intended form is theoretically not required, and the compiler should be prepared for this. Therefore, I always recommend using the contiguous attribute for your intended form arguments, where you know that you will call it continuous arrays.

In another answer, the change was completely meaningless, since it did not take advantage of the alleged arguments of the form. Namely, that you do not need to pass arguments with the dimensions of the array, and you can use built-in functions such as size() and shape() .


Here are the results of a comprehensive comparison. I made it as fair as possible. Fortran files are compiled with -Ofast , as shown above:

 import numpy as np import dos as dos import sas as sas from matplotlib import pyplot as plt import timeit import numexpr as ne #%matplotlib inline ne.set_num_threads(1) def test_n(n_comp): cp = np.tile(1.243,n_comp) cs = 100. qs = np.tile(1100.,n_comp) x= np.random.rand(3*n_comp+1) def test_dos(): denom = np.empty(1) dos.get_denomsas(qs,x,cp,cs,n_comp) def test_sas(): denom = np.empty(1) sas.get_denomsas(qs,x,cp,cs,n_comp) def get_denom(): k = x[0:n_comp] sigma = x[n_comp:2*n_comp] z = x[2*n_comp:3*n_comp] # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992) a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp denom = np.sum(a) + cs return denom def get_denom_numexp(): k = x[0:n_comp] sigma = x[n_comp:2*n_comp] z = x[2*n_comp:3*n_comp] loc_cp = cp loc_cs = cs loc_qs = qs # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992) a = ne.evaluate('(sigma + z)*( k*(loc_qs/loc_cs)**(z-1) )*loc_cp' ) return cs + np.sum(a) print 'py', timeit.Timer(get_denom).timeit(1000000/n_comp) print 'dos', timeit.Timer(test_dos).timeit(1000000/n_comp) print 'sas', timeit.Timer(test_sas).timeit(1000000/n_comp) print 'ne', timeit.Timer(get_denom_numexp).timeit(1000000/n_comp) def test(): for n in [10,100,1000,10000,100000,1000000]: print "-----" print n test_n(n) 

Results:

  py dos sas numexpr 10 11.2188110352 1.8704519272 1.8659651279 28.6881871223 100 1.6688809395 0.6675260067 0.667083025 3.4943861961 1000 0.7014708519 0.5406000614 0.5441288948 0.9069931507 10000 0.5825948715 0.5269498825 0.5309231281 0.6178650856 100000 0.5736029148 0.526198864 0.5304090977 0.5886831284 1000000 0.6355218887 0.5294830799 0.5366530418 0.5983200073 10000000 0.7903120518 0.5301260948 0.5367569923 0.6030929089 

speed comparison

You can see that there is very little difference between the two versions of Fortran. Array syntax is a bit slower, but nothing to talk about.

Conclusion 1 . In this comparison, the overhead for everyone should be fair, and you see that for the ideal vector length, Numpy and Numexpr CAN almost reach Fortran's performance, but when the vector is too small or perhaps even too much overhead for Python solutions prevail.

Conclusion 2 : The higher performance SAS version in another comparison is caused by a comparison with the original OP version, which is not equivalent. An equivalent optimized version of the DO loop is included in my answer.

+2
source share

In addition to my previous answer and Vladimir's weak speculation, I set two s / r: one as the source, and one used the array sections and Sum (). I also wanted to demonstrate that Vladimir sees the optimization of the Do cycle as weak.

Also, the point that I usually do for benchmarking is that the n_comp size in the above example is too small. In the results below, each of the ā€œoriginalā€ and ā€œbestā€ SumArraySection (SAS) options is repeated in cycles repeating 1000 times inside synchronization calls, so the results are for 1000 calculations of each s / r. If your timings are fractions of a second, they are probably unreliable.

There are several options worth considering, none with an explicit pointer. One option used for these illustrations is

 subroutine get_denomSAS (qs,x,cp,cs,n_comp,denom) ! Calculates the denominator in the SMA model (Brooks and Cramer 1992) ! The function is called at a specific salt concentration and isotherm point ! I loops over the number of components implicit none ! declaration of input variables integer, intent(in) :: n_comp ! number of components double precision, intent(in) :: cs,qs ! salt concentration, free ligand concentration double precision, Intent(In) :: cp(:) double precision, Intent(In) :: x(:) ! declaration of local variables integer :: i ! declaration of outpur variables double precision, intent(out) :: denom ! ! double precision :: qsDcs ! ! qsDcs = qs/cs ! denom = Sum( (x(n_comp+1:2*n_comp) + x(2*n_comp+1:3*n_comp))*(x(1:n_comp)*(qsDcs) & **(x(2*n_comp+1:3*n_comp)-1))*cp(1:n_comp) ) + cs ! ! end subroutine get_denomSAS 

The key differences are:

a) transferred arrays (:) b) no array assignments in s / r, use array sections (equivalent to "efficient" pointers) instead. c) Use Sum () instead of Do

Then try two different compiler optimizations to demonstrate the consequences.

As the two diagrams show, the orig code (blue diamonds) is much slower than cf SAS (red squares) with low optimization. SAS is still better with high optimization, but they are getting closer. This is partly due to the fact that Sum () is "optimized" when using low compiler optimization.

enter image description here

The yellow lines show the ratio between the two s / r timings. Ignore the value of the yellow line at "0" in the upper image (n_comp is too small, due to which one of the timings will be unstable)

Since I don’t have user input for Numpy, I can only make the statement that the SAS curve in his chart should lie below his current Fortran results and may be flatter or even trendier.

In other words, there really cannot be a discrepancy observed in the original publication, or at least not to that extent.

... although more experimentation may be useful to demonstrate other comments / answers already provided.

Dear Moritz: oh, I forgot to mention your question about pointers. As before, the key reason for the improvement with the SAS change is that it makes better use of "efficient pointers" in the sense that it eliminates the need to reassign the x () array to three new local arrays (i.e., since x is passed by ref, using array sections is a kind of pointer approach built into Fortran, and therefore there is no need for explicit pointers), but then Sum () or Dot_Product () or something else is required.

Instead, you can save Do and achieve something similar by changing x either to a n_compx3 2D array, or directly pass three explicit 1D arrays of order n_comp. This decision will most likely be determined by the size and complexity of your code, because for this you will need to rewrite the call / sr operators, etc., And in another place, x () is used. Some of our projects comprise> 300,000 lines of code, so in this case it is much cheaper to change the code locally, for example, to SAS, etc.

I'm still waiting to get permission to install Numpy in one of our fields. As noted earlier, I wonder why your relative timings imply that Numpy improves with n_comp increasing ...

Of course, comments about the ā€œcorrectā€ benchmarking, etc., as well as the question of what compiler keys are implied when using fpy, are still applied, since they can significantly change the nature of your results.

I would be interested to see your results if they were updated for these permutations.

0
source share

There is not enough information in the notes, but some of the following may help:

1) Fortran optimized built-in functions such as "Sum ()" and "Dot_Product", which you can use instead of a Do loop to summarize, etc.

In some cases (rather than neccessiraly here), it may be ā€œbetterā€ to use ForAll or something else to create the meta arrays that need to be summed, and then apply the summation in the meta arrays.

However, Fortran allows array partitions, so you do not need to create automatic / intermediate arrays sigma, k and z and freed overhead. Instead, it could be something like

 n_compP1 = n_comp+1 n_compT2 = n_comp*2 a = Sum( x(1:n_comp)+2*x(n_compP1,n_compT2) ) ! ... just for example 

2) Sometimes (depending on the compiler, machine, etc.) there may be ā€œmemory statementsā€ if the array sizes do not have certain ā€œbinary intervalsā€ (for example, 1024 versus 1000), etc.

You might want to repeat your experiments at a few more points on your diagram (that is, at various other "n_comps"), and especially near such "borders".

3) It is not possible to determine if you are using full compiler optimization (flags) to compile fortran code. You might want to view the various -o flags, etc.

4) You might want to include the OpemMP directive (or at least include openmp in your flags, etc.). Sometimes this can improve some utility problems, even if they do not explicitly refer to OpenMP directives in your loops, etc.

5) General: this probably applies to each of your methods where loops are used

a) ā€œcontinuous operationsā€ in the ā€œsummationā€ formula can be performed outside the loop, for example. create something like qsDcs = qs / cs and use qsDcs in a loop.

b) In the same way, it is sometimes useful to create something like zM1 (:) = z (:) - 1 and instead use zM1 (:) in a loop.

-2
source share

All Articles