GCC auto-vectorization does not affect runtime, even when it is supposedly “profitable,”

Question

GCC auto-vectorization does not affect runtime, even when it is supposedly “profitable,”

I have spent the last few days reading autovectorization with gcc 4.7. I followed some examples that I saw on the Internet and the setup seems to be correct. But when I actually run the code and compare it with attachment or shutdown, there is no noticeable difference at runtime.

Here is the code I worked with:

#include <string.h> #include <stdlib.h> #include <emmintrin.h> #include <stdio.h> #include <math.h> int main(int argc, char** argv) { long b = strtol(argv[2], NULL, 0); unsigned long long int i; unsigned long long int n = (int)pow(2,29); float total = 0; float *__restrict__ x1; float *__restrict__ y1; posix_memalign((void *)&x1, 16, sizeof(float)*n); posix_memalign((void *)&y1, 16, sizeof(float)*n); float *__restrict__ x = __builtin_assume_aligned(x1,16); float *__restrict__ y = __builtin_assume_aligned(y1,16); for (i=0;i<n;i++) { x[i] = i; y[i] = i; } for (i=0; i<n; i++) { y[i] += x[i]; } printf("y[%li]: \t\t\t\t%f\n", b,y[b]); printf("correct answer: \t\t\t%f\n", (b)*2); return 0; }

Some of these things seem redundant to me, but it is necessary for the compiler to understand what is happening (especially the fact that the data has been aligned). The variable "b", which is read from the command line, exists only because I was paranoid about a compiler that completely optimizes the loop.

Here is the compiler command when the vectorization function is enabled:

 gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -ftree-vectorize -ov

Basically, this is equivalent to simply using -O3. I put the flags in myself, so all I had to do was remove "ftree-vectorize" and be able to check the result without vectorization.

Here is the result of verbose-vectorization flags to show that the code is actually vectorized:

 Analyzing loop at test.c:29 29: vect_model_load_cost: aligned. 29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_load_cost: aligned. 29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_store_cost: aligned. 29: vect_model_store_cost: inside_cost = 1, outside_cost = 0 . 29: cost model: Adding cost of checks for loop versioning aliasing. 29: Cost model analysis: Vector inside of loop cost: 4 Vector outside of loop cost: 4 Scalar iteration cost: 4 Scalar outside cost: 1 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 2 29: Profitability threshold = 3 Vectorizing loop at test.c:29 29: Profitability threshold is 3 loop iterations. 29: created 1 versioning for alias checks. 29: LOOP VECTORIZED. Analyzing loop at test.c:24 24: vect_model_induction_cost: inside_cost = 2, outside_cost = 2 . 24: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 . 24: not vectorized: relevant stmt not supported: D.5806_18 = (float) D.5823_58; test.c:7: note: vectorized 1 loops in function.

Note that vectorization is beneficial after three iterations, and I'm running with 2 ^ 29 ~ = 500,000,000 iterations. So, I should expect runtime to be significantly different with disabling vectorization, right?

Well, here is the code execution time (I ran it 20 times in a row):

 59.082s 79.385s 57.557s 57.264s 53.588s 54.300s 53.645s 69.044s 57.238s 59.366s 56.314s 55.224s 57.308s 57.682s 56.083s 369.590s 59.963s 55.683s 54.979s 62.309s

Throwing away this strange ~ 370s outlier, it gives an average runtime of 58.7s with a standard deviation of 6.0s.

Then I compile the same command as before, but without a flag without a flag:

 gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -o nov

Starting the program again 20 times in a row leads to the following points:

 69.471s 57.134s 56.240s 57.040s 55.787s 56.530s 60.010s 60.187s 324.227s 56.377s 55.337s 54.110s 56.164s 59.919s 493.468s 63.876s 57.389s 55.553s 54.908s 56.828s

Throwing out the emissions again, this gives an average run time of 57.9 with a standard deviation of 3.6 s.

Thus, these two versions have statistically indistinguishable time series.

Can someone tell me what I am doing wrong? Does the compiler generate a “profit threshold”, what do I think it means? I really appreciate any help that people can give me, I tried to figure it out last week.

EDIT

I implemented the change that @nilspipenbrinck suggested and it seems to work. I inserted a vector loop into a function and called this function a boat once. Relative run times are now 24.0 s (sigma <0.1 s) without vectorization versus 20.8 s (sigma <0.2 s) for vectorization or 13% speed improvement. Not as much as I hoped, but at least now I know his work! Thank you for taking the time to look at my question and write an answer, I am very grateful.

+7

c gcc

user2635263 Nov 15 '14 at 12:46

source share

2 answers

There are several factors that determine how beneficial code vectorization will be. In this case (based on your output), the compiler only vectorizes one cycle, I think the second one, because the first one is usually ignored, because there are not enough calculations to make it profitable for vectorization.

The runtime you submit is for the whole code, not a single loop, so there will only be so much vectorization for the whole run time. If you really want to see how many improvements exist from vectorization, I would suggest launching a profiler such as AMD Code XL, Intel Vtune, OProfile, etc. It will tell you specifically about this cycle, how much improvement in terms of time and performance you do.

Now I am working on evaluating vectorization compilers, and I would have worked with increasing the vector up to 60 times faster, while the acceleration is not so impressive, and it all depends on the loop, compiler and architecture you use.

+2

David Sep 22 '15 at 20:37

source share

Nils pipenbrinck · Accepted Answer · 2014-11-15T14:54:16+0000

You do not do much arithmetic. Therefore, the execution time of your test code is limited by memory. For example. you spend most of the time moving data between the CPU and memory.

Also, your n is very large with 2 ^ 29 items. Therefore, you cannot use the cache of the first and second levels.

If you want to see improvements in SSE, use a smaller n so that you only touch 8 or 16 kilobytes of data. Also make sure that the data is "hot", for example, the processor has recently accessed it. Thus, data should not be moved from the main memory, but they are moved faster from caches, which is several times faster.

Alternatively, you can also do a lot more arithmetic. This would give the memory prefetching system the ability to retrieve data from main memory in the background when you use a math processor.

To summarize: if arithmetic is faster than your system can move memory around you, you will not see any benefits. Memory access time will be a bottleneck, and several cycles that you save using the SSE instruction set will be lost in the memory access timing timing noise.

GCC auto-vectorization does not affect runtime, even when it is supposedly “profitable,”

More articles: