I donβt think your triple loop will automatically be drawn. IMO - problems:
- Access to memory is through the object type std :: vector. AFAIK I donβt think that any compiler will automatically vectorize the std :: vector code, unless the access operators [] or () are embedded, but still, it is not clear to me that it will be automatically generated.
- Your code suffers from memory smoothing i.e. the compiler does not know whether the memory you are accessing
img accessing from another memory pointer, and this will most likely block the vectorization. Basically, you need to define a simple double array and hint to the compiler that no other pointer refers to the same place. I think you can do this using __restrict . __restrict tells the compiler that this pointer is the only pointer pointing to this memory location, and that there are no other pointers, and therefore there are no side effects. The memory is not aligned by default, and even if the compiler controls the auto-vector, vectorization of unchanged memory is much slower than that of aligned memory. You need to make sure that your memory has 32 memory addresses aligned for using automatic vectorization, and AVX to the maximum and 16-bit addresses aligned for using SSE, to the maximum, that is, always align 32 memory bit addresses. This can be done dynamically with:
double* buffer = NULL; posix_memalign((void**) &buffer, 32, size*sizeof(double)); ... free(buffer);
in MSVC you can do this with __declspec(align(32)) double array[size] , but you need to check with the specific compiler that you use to make sure that you are using the correct alignment directives.
Another important thing: if you use the GNU compiler, use the -ftree-vectorizer-verbose=6 flag to check if your loop is an auto-vector. If you are using the Intel compiler, use -vec-report5 . Note that there are several levels of verbosity and information output, i.e. numbers 6 and 5, so check the compiler documentation. The higher the verbosity level, the more vectorization information you will receive for each loop in your code, but the compiler will compile slower in Release mode.
In general, I was always surprised at how it is NOT easy to force the compiler to auto-vectorize, a common mistake is that since the loop looks canonical, then the compiler automatically autogenizes it.
UPDATE: and one more thing, make sure your img is actually aligned on the posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double)); page posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double)); (which implies alignment of AVX and SSE). The problem is that if you have a large image, this cycle is likely to lead to page transitions at runtime, which is also very expensive. I think this so-called TLB skips.
source share