Vectorized amount in Fortran

I compile Fortran code using -mavx and -mavx and verified that some instructions are vectorized via objdump , but I do not get the speed improvement that I expected, so I want to make sure the next argument is vectorized (this one command is ~ 50% of runtime).

I know that some instructions can be vectorized and others not, so I want to make sure that it can be:

sum(A(i1:i2,ir))

Again, this single row takes about 50% of the execution time, since I am doing this on a very large matrix. I can give more information on why I do this, but suffice it to say that it is necessary, although I can, if necessary, restructure the memory (for example, I could make the sum as sum(A(ir,i1:i2)) , if it could be vectorized.

Is this line vectorized? How can i say How to force vectorize if it is not vectorized?

EDIT: thanks to the comments, I now understand that I can check for the vector of this summation via -ftree-vectorizer-verbose and see that this is not a vectorization. I changed the code as follows:

 tsum = 0.0d0 tn = i2 - i1 + 1 tvec(1:tn) = A(i1:i2, ir) do ii = 1,tn tsum = tsum + tvec(ii) enddo 

and this ONLY vectorized when you enable -funsafe-math-optimizations , but I see another 70% increase in speed due to vectorization. The question still remains: why can't sum(A(i1:i2,ir)) vectorize and how can I get a simple sum for vectorization?

+6
source share
2 answers

It turns out that I cannot use the vector unless -ffast-math or -funsafe-math-optimizations .

The two code snippets I played with are the following:

 tsum = 0.0d0 tvec(1:n) = A(i1:i2, ir) do ii = 1,n tsum = tsum + tvec(ii) enddo 

and

 tsum = sum(A(i1:i2,ir)) 

and here is the time that I get when I run the first code fragment with various compilation options:

 10.62 sec ... None 10.35 sec ... -mtune=native -mavx 7.44 sec ... -mtune-native -mavx -ffast-math 7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations 

Finally, with the same optimizations, I can vectorize tsum = sum(A(i1:i2,ir)) to get

  7.96 sec ... None 8.41 sec ... -mtune=native -mavx 5.06 sec ... -mtune=native -mavx -ffast-math 4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations 

When we compare sum and -mtune=native -mavx with -mtune=native -mavx -funsafe-math-optimizations , it shows an acceleration of 70%. (Please note that they only started once - before we publish, we will perform real benchmarking on several runs).

However, I am doing a little punch. My values ​​change a bit when using the -f options. Without them, errors for my variables ( v1 , v2 ):

 v1 ... 5.60663e-15 9.71445e-17 1.05471e-15 v2 ... 5.11674e-14 1.79301e-14 2.58127e-15 

but with optimizations, errors:

 v1 ... 7.11931e-15 5.39846e-15 3.33067e-16 v2 ... 1.97273e-13 6.98608e-14 2.17742e-14 

which indicates that something else is really happening.

+1
source

Your explicit version of the loop still adds FP in a different order than the vector version. The vector version uses 4 batteries, each of which receives every 4th element of the array.

You can write your source code according to what the vector version will do:

 tsum0 = 0.0d0 tsum1 = 0.0d0 tsum2 = 0.0d0 tsum3 = 0.0d0 tn = i2 - i1 + 1 tvec(1:tn) = A(i1:i2, ir) do ii = 1,tn,4 ! count by 4 tsum0 = tsum0 + tvec(ii) tsum1 = tsum1 + tvec(ii+1) tsum2 = tsum2 + tvec(ii+2) tsum3 = tsum3 + tvec(ii+3) enddo tsum = (tsum0 + tsum1) + (tsum2 + tsum3) 

It can be vectorized without -ffast-math .

FP add has a multi-tasking delay, but one or two per bandwidth of each measure, so you need to use asm to use multiple vector batteries to saturate the blocks (s) of adding FP. Skylake can do two FP additions per cycle with latency = 4. Previous Intel processors do one per cycle, with a delay = 3. So, on Skylake you need 8 vector batteries to saturate FP units. And, of course, they should be 256b vectors because the AVX instructions work just as fast, but they work twice as much with the SSE vector instructions.

Writing a source with 8 * 8 battery variables would be ridiculous, so I think you need -ffast-math , or the OpenMP pragma, which tells the compiler that different orders of operations are in order.

Explicit source deployment means that you need to handle the number of cycles that is not a multiple of the width of the vector * is expanded. If you put restrictions on things, this can help the compiler avoid generating multiple versions of the loop or the loop setup / cleanup code.

0
source

All Articles