Explicit vectorization

As I understand it, most modern compilers automatically use SIMD commands for loops, if necessary, if I set the corresponding compiler flag. Since the compiler can use only a vector, if he can be sure that this will not change the semantics of the program, he will not use vectorization in those cases when I really know that it is safe, but the compiler for various reasons does not believe that this is not so .

Are there explicit vectorization instructions that I can use in simple C ++ without libraries that allow me to process the vectorized data myself, rather than relying on the compiler? I assume it will look something like this:

double* dest; const double* src1, src2; // ... for (uint32 i = 0; i < n; i += vectorization_size / sizeof(double)) { vectorized_add(&dest[i], &src1[i], &src2[i]); } 
+6
source share
2 answers

Normal C ++? No. std::valarray can bring your compiler to SIMD water, but it cannot make it drink.

OpenMP is the least "library" of a library: it is more of a language extension than a library, and all major C ++ compilers support it. While primarily and historically used for multi-core parallelism, OpenMP 4.0 introduced SIMD-specific constructs that can at least prompt your compiler to vectorize certain clearly-vectorized procedures, even with explicitly scalar routines. It can also help you identify aspects of your code that prevent compiler vectorization. (And besides ... you also don't want multi-core parallelism?)

 double* dest; const double* src1, src2; #pragma omp simd for (int i = 0; i < n; i++) { dest[i] = src1[i] + src2[i]; } 

To go to the last mile with operations with reduced accuracy, multi-level aggregation, masking without branches, etc., really requires an explicit connection to the base set of commands and is impossible with anything close to "simple C ++". OpenMP can get you pretty far.

+2
source

TL DR There are no guarantees, but KISS, and you are likely to get highly optimized code. Measure and verify the generated code before changing it.

You can play with this in online compilers, for example. gcc.godbolt will vectorize the next simple call to std::transform for gcc 5.2 with -O3

 #include <algorithm> const int sz = 1024; void f(double* src1, double* src2, double* dest) { std::transform(src1 + 0, src1 + sz, src2, dest, [](double lhs, double rhs){ return lhs + rhs; }); } 

Q & A was like this week . The main theme, apparently, is that on modern processors and compilers, the simpler your code (regular algorithms call), the more likely you are to get highly optimized (vectorized, deployed) code.

0
source

All Articles