Decrease DO loop performance by potentially changing related variables

In the following code, we pass two arrays to a subroutine and perform some additional operations inside DO loops. Here we consider three cases when different operations are performed: Case 1 = no operation, cases 2 and 3 = assignment of pointer variables.

!------------------------------------------------------------------------ module mymod implicit none integer, pointer :: n_mod integer :: nloop contains !......................................................... subroutine test_2dim ( a, b, n ) integer :: n real :: a(n,n), b(n,n) integer, pointer :: n_ptr integer i1, i2, iloop n_ptr => n_mod do iloop = 1, nloop do i2 = 1, n do i1 = 1, n b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop !(nothing here) !! Case 1 : gfort => 3.6 sec, ifort => 3.2 sec ! n_ptr = n !! Case 2 : gfort => 15.9 sec, ifort => 6.2 sec ! n_ptr = n_mod !! Case 3 : gfort => 3.6 sec, ifort => 3.5 sec enddo enddo enddo endsubroutine endmodule !------------------------------------------------------------------------ program main use mymod implicit none integer, target :: n real, allocatable :: a(:,:), b(:,:) nloop = 10000 ; n = 1000 allocate( a( n, n ), b( n, n ) ) a = 0.0 ; b = 0.0 n_mod => n call test_2dim ( a, b, n ) print *, a(n,n), b(n,n) !! for check end 

Here we note that this pointer is associated with the upper bounds of the DO loop through a modular variable (n_mod). Thus, changing pointer variables inside a loop should affect the behavior of loops. But keep in mind that we do not change the boundaries in practice (just a copy of the variable). gfortran 4.8 and ifort 14.0 with -O3 gave the time specified above. It is noteworthy that case 2 is very slow compared to case 1, despite the fact that the pure calculation seems completely different. I suspected that this could be due to the fact that the compiler cannot determine if the upper bound of the second loop (for i1) has been changed by assigning a pointer, so avoid aggressive optimizations. To test this, I tested the following procedure instead of test_2dim ():

 subroutine test_1dim ( a, b, n ) integer :: n real :: a(n * n), b(n * n) integer, pointer :: n_ptr integer iloop, i n_ptr => n_mod do iloop = 1, nloop do i = 1, n * n b( i ) = a( i ) + b( i ) + iloop ! (nothing here) !! Case 1 : gfort => 3.6 sec, ifort => 2.3 sec ! n_ptr = n !! Case 2 : gfort => 15.9 sec, ifort => 6.0 sec ! n_ptr = n_mod !! Case 3 : gfort => 3.6 sec, ifort => 6.1 sec enddo enddo endsubroutine 

Here the only difference between test_1dim () and test_2dim () is that 1 or 2-dimensional indexes are available to arrays a and b (essentially not a difference in the size of the calculation). Surprisingly, case 2 also gave a slow result, although there is only one DO loop. Since Fortran DO loops define the upper bound of the loop at the [Ref] input, I expected test_1dim () to not be affected by the purpose of the pointer, although that was not the case. So, is there a reasonable explanation for this behavior? (I hope that I am not mistaken, which leads to a difference in this time.)


My motivation for this question: I have widely used derived types to define multidimensional loops, e.g.

 module Grid_mod type Grid_t integer :: N1, N2, N3 endtype .... subroutine some_calc ( vector, grid ) type(Grid_t) :: grid .... do i3 = 1, grid % N3 do i2 = 1, grid % N2 do i1 = 1, grid % N1 (... various operations...) enddo enddo enddo 

So far, I have not paid much attention to whether the Grid_t objects are provided with the TARGET or POINTER attribute (provided that it has almost no effect on performance). However, now I think that this can lead to poor performance if the compiler cannot determine if the upper bounds are constant inside loops (although I will never change the bounds in real codes). Therefore, I would appreciate any advice on whether I should be more careful using the TARGET or POINTER attributes for related variables (including the derived type components specified by the mesh object above).


Update

Following the suggestion of @francescalus, I tried to add "intent (in), value" to the dummy argument "n". The result is as follows:

 test_1dim(): Case 1: gfort => 3.6 s, ifort => 2.3 s Case 2: gfort => 3.6 s, ifort => 3.1 s Case 3: gfort => 3.6 s, ifort => 3.4 s test_2dim(): Case 1: gfort => 3.7 s, ifort => 3.1 s Case 2: gfort => 3.7 s, ifort => 3.1 s Case 3: gfort => 3.7 s, ifort => 6.4 s 

Although ifort gives a somewhat irregular result (6.4 s) for Case 3 in test_2dim (), all other results show essentially better performance. This suggests that processing evaluations by the compiler affects performance (not because of the cost of assigning a pointer). Since it seems important to tell the compiler that the boundaries are constant, I also tried to copy the dummy argument n (here, and not with the “intent (in), value”) to the local variable n_ and use it as the boundaries of the loop:

  integer :: n !! dummy argument integer :: n_ !! a local variable ... n_ = n do i2 = 1, n_ do i1 = 1, n_ b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop ... 

The result for test_2dim () is as follows:

 test_2dim(): Case 1: gfort => 3.6 s, ifort => 3.1 s Case 2: gfort => 15.9 s, ifort => 6.2 s Case 3: gfort => 3.7 s, ifort => 6.4 s 

Here, unfortunately (and in contrast to my expectation), Case 2 has not improved at all ... Although copying to local n_ should ensure that n_ is a constant in DO loops, the compiler seems unhappy because the array form is still defined by n , not n_, thereby avoiding aggressive optimization (<- just my hunch).


Update2

Following @innoSPG's suggestion, I also changed n to n_ inside DO loops for Case 2, and then it turned out that the code was as fast as Case 1! In particular, the code

  n_ = n do i2 = 1, n_ do i1 = 1, n_ b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop n_ptr = n_ !! Case 2 : gfort => 3.7 sec, ifort => 3.1 sec 

But, as the answer suggests, this efficiency can be caused by the fact that the assignment operator is completely excluded by the compiler. Therefore, I believe that I need to consider more practical codes (not too simple) to check the effect of pointers or pointer components on loop optimization ...

(... I apologize for the very long question ...)

+5
source share
1 answer

When you perform the optimization, the compiler takes longer to learn more about your program so that it can avoid any unnecessary calculations in addition to using the architecture. It is highly dependent on the compiler and architecture.

Therefore, I assume that the compiler knows in advance that n_ptr and n_mod are exactly the same and do not even spend time on assignment for case 3. The scenario is similar to scenario 2, when n (in), the compiler can predict that it is not necessary to execute the assignment in loop, this needs to be done only once, because n_ptr does not participate in any other calculation in the routine. I suspect Ifort has missed this point. In addition, you can work on the basis of Intel architecture, this gives some advantages for ifort.

In case 2, when n not intentional (in), the compiler remembers that it is the target and can be changed in many other ways, and there is absolutely no clue to pointers that point to it. This adds pointer dereferencing during assignment to calculate the calculation time. In principle, when loading / saving a specified value from a pointer variable, it takes twice as much time to load / save a value from a variable without a pointer. I don’t know how pointers are actually implemented in fortran, so I can’t give serious hints about time factors. It is highly dependent on the implementation of pointers and goals.

I have not tried this yet, but I suggest for your test with the local variable n_ you also change the right side of the destination of case 2 to the local variable n_ . I have a strong belief that you will get the same time as in case 1, because the compiler can predict that there is no need to complete the assignment in a loop.

 n_ = n do iloop = 1, nloop do i2 = 1, n_ do i1 = 1, n_ b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop !(nothing here) !! Case (1) !n_ptr = n_ !! Case (2) !n_ptr = n_mod !! Case (3) enddo enddo enddo 
+2
source

All Articles