I have spent the last few days reading autovectorization with gcc 4.7. I followed some examples that I saw on the Internet and the setup seems to be correct. But when I actually run the code and compare it with attachment or shutdown, there is no noticeable difference at runtime.
Here is the code I worked with:
#include <string.h> #include <stdlib.h> #include <emmintrin.h> #include <stdio.h> #include <math.h> int main(int argc, char** argv) { long b = strtol(argv[2], NULL, 0); unsigned long long int i; unsigned long long int n = (int)pow(2,29); float total = 0; float *__restrict__ x1; float *__restrict__ y1; posix_memalign((void *)&x1, 16, sizeof(float)*n); posix_memalign((void *)&y1, 16, sizeof(float)*n); float *__restrict__ x = __builtin_assume_aligned(x1,16); float *__restrict__ y = __builtin_assume_aligned(y1,16); for (i=0;i<n;i++) { x[i] = i; y[i] = i; } for (i=0; i<n; i++) { y[i] += x[i]; } printf("y[%li]: \t\t\t\t%f\n", b,y[b]); printf("correct answer: \t\t\t%f\n", (b)*2); return 0; }
Some of these things seem redundant to me, but it is necessary for the compiler to understand what is happening (especially the fact that the data has been aligned). The variable "b", which is read from the command line, exists only because I was paranoid about a compiler that completely optimizes the loop.
Here is the compiler command when the vectorization function is enabled:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -ftree-vectorize -ov
Basically, this is equivalent to simply using -O3. I put the flags in myself, so all I had to do was remove "ftree-vectorize" and be able to check the result without vectorization.
Here is the result of verbose-vectorization flags to show that the code is actually vectorized:
Analyzing loop at test.c:29 29: vect_model_load_cost: aligned. 29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_load_cost: aligned. 29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . 29: vect_model_store_cost: aligned. 29: vect_model_store_cost: inside_cost = 1, outside_cost = 0 . 29: cost model: Adding cost of checks for loop versioning aliasing. 29: Cost model analysis: Vector inside of loop cost: 4 Vector outside of loop cost: 4 Scalar iteration cost: 4 Scalar outside cost: 1 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 2 29: Profitability threshold = 3 Vectorizing loop at test.c:29 29: Profitability threshold is 3 loop iterations. 29: created 1 versioning for alias checks. 29: LOOP VECTORIZED. Analyzing loop at test.c:24 24: vect_model_induction_cost: inside_cost = 2, outside_cost = 2 . 24: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 . 24: not vectorized: relevant stmt not supported: D.5806_18 = (float) D.5823_58; test.c:7: note: vectorized 1 loops in function.
Note that vectorization is beneficial after three iterations, and I'm running with 2 ^ 29 ~ = 500,000,000 iterations. So, I should expect runtime to be significantly different with disabling vectorization, right?
Well, here is the code execution time (I ran it 20 times in a row):
59.082s 79.385s 57.557s 57.264s 53.588s 54.300s 53.645s 69.044s 57.238s 59.366s 56.314s 55.224s 57.308s 57.682s 56.083s 369.590s 59.963s 55.683s 54.979s 62.309s
Throwing away this strange ~ 370s outlier, it gives an average runtime of 58.7s with a standard deviation of 6.0s.
Then I compile the same command as before, but without a flag without a flag:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -o nov
Starting the program again 20 times in a row leads to the following points:
69.471s 57.134s 56.240s 57.040s 55.787s 56.530s 60.010s 60.187s 324.227s 56.377s 55.337s 54.110s 56.164s 59.919s 493.468s 63.876s 57.389s 55.553s 54.908s 56.828s
Throwing out the emissions again, this gives an average run time of 57.9 with a standard deviation of 3.6 s.
Thus, these two versions have statistically indistinguishable time series.
Can someone tell me what I am doing wrong? Does the compiler generate a “profit threshold”, what do I think it means? I really appreciate any help that people can give me, I tried to figure it out last week.
EDIT
I implemented the change that @nilspipenbrinck suggested and it seems to work. I inserted a vector loop into a function and called this function a boat once. Relative run times are now 24.0 s (sigma <0.1 s) without vectorization versus 20.8 s (sigma <0.2 s) for vectorization or 13% speed improvement. Not as much as I hoped, but at least now I know his work! Thank you for taking the time to look at my question and write an answer, I am very grateful.